none
MPI job aborts for unexplained reasons RRS feed

  • Question

  • Hi there--

    We're upgrading to HPC Pack 2008 from CCS Server 2003, moving our existing C++ and C# MPI programs to HPC Pack 2008.  Generally things seem to progress, but one C++ MPI program is failing.

    We fire off a job, and things work for a while, then we get a crash.

    In the task properties, the Error message is:  "Task failed during execution with exit code -4. Please check task's output for error details."

    The "Output:" box in the HPC Job manager is empty.  However if we look in the log file that captured stdout/stderr from the job, we see:

       Aborting: smpd on MT-CCS-01 failed to communicate with smpd on MT-CCS-11
       Other MPI error, error stack:
       ReadFailed(1317): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (errno 10060)
    

    No indiciation of failures on the logs at mt-ccs-11.  Under 2003, I would look for %windir%\pchealth\ErrorRep\UserDumps, but that directory isn't here.  Should I enable error reporing?  Any suggestions on how to diagnose?

    Thanks for your help!

    Tuesday, May 12, 2009 10:12 PM

Answers

  • Hi Chris,

    From the error message, it seems to be related to firewalls which blocks MPI communications. Can you try the following?

    1) turn off firewalls on all compute nodes.
    If this can fix the issue, it is related to firewalls.

    2) If 1) works, you may restore firewalls. then add your MPI program to firewall exceptions on all active domains on all the Compute nodes.

    hope this helps

    Liwei
    Wednesday, January 20, 2010 3:15 AM

All replies

  • Chris, it doesn't look like anyone ever responded to your post. Sorry about that.

    Were you able to get your application working on our 2008 version? You shouldn't have had to do much for it to work...
    Wednesday, December 9, 2009 6:32 AM
    Moderator
  • Chris, I have seen this kind of error before. The problem is that the cluster network messed up. Reconfiging the network fixed the problem. Could you try reconfiging the network and see whether this can solve your problem?

    James 
    Wednesday, January 20, 2010 12:20 AM
  • Hi Chris,

    From the error message, it seems to be related to firewalls which blocks MPI communications. Can you try the following?

    1) turn off firewalls on all compute nodes.
    If this can fix the issue, it is related to firewalls.

    2) If 1) works, you may restore firewalls. then add your MPI program to firewall exceptions on all active domains on all the Compute nodes.

    hope this helps

    Liwei
    Wednesday, January 20, 2010 3:15 AM