The goal of this page is to try to help you to find why your
QCG-OMPI job failed. It is here to provide you some guidance in
finding why it failed, in which stage of the job and how to
solve the problem.
We provide a method to investigate on the cause of this
failure, and the appropriate behavior in function of the
encountered situation.
The very first thing to do is to switch into the debug
mode. This means that your QCG-OMPI job will be much more
verbose, and hopefully this will tell you where the problem
is.
To enable display of the debug information, add -d somewhere
in your command line, before the -- (double dash).
You should be able to see from the execution trace which
component had a problem and crashed. Some components may
create a chain reaction if they die: other components may
crash and the trace will display "process ... died
unexpectantly". So you will have to find which component died
first.
Each component produces an execution trace, and the debug
mode displays them one after each other. You can also look
into the components' traces to find an error message.
One very common problem is the fact that a component is
trying to use a port which is already used by another
program. The corresponding error message is "Unable to bind
socket: Address already in use". It can be caused by two
situations:
- You did not change the ports you used for the previous
execution, and you tried to launch this one short time after
the other one. Even after having been closed in a "clean"
way and even after the program has ended, sockets are kept
for a moment (configurable in the OS kernel, most often 1
minute) in a TIME_WAIT state. This state is kept for a short
moment to make sure all the data has gone through, and to
catch stray packets for that connection after the connection
is closed. This is the reason why you are advised to use a
different port range for two successive executions.
- An other application is already using this port. You
can check which ports are used using the netstat
tool. netstat -lapte should tell you which ports are used
for TCP connections (QCG-OMPI uses TCP sockets).
The later can happen especially if your cleanup script is
incorrectly configured and misses some machines that you are
actually using. If so, some components from previous execution
may remain present and running, and then keeping ports busy.
Another common error is an malformed configuration file:
most often, a wrong number of machines in a cluster (the
number of machines does not correspond to the one that was
declared in the description of the cluster).
In case of a read/write permissions problem in the directory
for temporary files you specified, you will have an error
message about "permission denied" ou "wrong persmissions". If
you did not specify any tmp directory, some files may be in
the one used by default (/tmp) that do not belong to you and
then you are not allowed to delete nor overwrite.