The goal of this page is to try to help you to find why your QCG-OMPI job failed. It is here to provide you some guidance in finding why it failed, in which stage of the job and how to solve the problem.

We provide a method to investigate on the cause of this failure, and the appropriate behavior in function of the encountered situation.

Use the diagnostic tools

The very first thing to do is to switch into the debug mode. This means that your QCG-OMPI job will be much more verbose, and hopefully this will tell you where the problem is.

To enable display of the debug information, add -d somewhere in your command line, before the -- (double dash).

Read the execution trace

You should be able to see from the execution trace which component had a problem and crashed. Some components may create a chain reaction if they die: other components may crash and the trace will display "process ... died unexpectantly". So you will have to find which component died first.

Each component produces an execution trace, and the debug mode displays them one after each other. You can also look into the components' traces to find an error message.

Common error messages caused by misconfigurations

One very common problem is the fact that a component is trying to use a port which is already used by another program. The corresponding error message is "Unable to bind socket: Address already in use". It can be caused by two situations:

The later can happen especially if your cleanup script is incorrectly configured and misses some machines that you are actually using. If so, some components from previous execution may remain present and running, and then keeping ports busy.

Another common error is an malformed configuration file: most often, a wrong number of machines in a cluster (the number of machines does not correspond to the one that was declared in the description of the cluster).

In case of a read/write permissions problem in the directory for temporary files you specified, you will have an error message about "permission denied" ou "wrong persmissions". If you did not specify any tmp directory, some files may be in the one used by default (/tmp) that do not belong to you and then you are not allowed to delete nor overwrite.

Valid XHTML 1.0 Strict