發問發問
 

問題MPI job aborts for unexplained reasons

  • Tuesday, 12 May, 2009 22:12Chris Quirk 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     包含代碼

    Hi there--

    We're upgrading to HPC Pack 2008 from CCS Server 2003, moving our existing C++ and C# MPI programs to HPC Pack 2008.  Generally things seem to progress, but one C++ MPI program is failing.

    We fire off a job, and things work for a while, then we get a crash.

    In the task properties, the Error message is:  "Task failed during execution with exit code -4. Please check task's output for error details."

    The "Output:" box in the HPC Job manager is empty.  However if we look in the log file that captured stdout/stderr from the job, we see:

       Aborting: smpd on MT-CCS-01 failed to communicate with smpd on MT-CCS-11
       Other MPI error, error stack:
       ReadFailed(1317): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (errno 10060)
    

    No indiciation of failures on the logs at mt-ccs-11.  Under 2003, I would look for %windir%\pchealth\ErrorRep\UserDumps, but that directory isn't here.  Should I enable error reporing?  Any suggestions on how to diagnose?

    Thanks for your help!