MPI job aborts for unexplained reasons<p>Hi there--<br/><br/>We're upgrading to HPC Pack 2008 from CCS Server 2003, moving our existing C++ and C# MPI programs to HPC Pack 2008.  Generally things seem to progress, but one C++ MPI program is failing.<br/><br/>We fire off a job, and things work for a while, then we get a crash.<br/><br/>In the task properties, the Error message is:  &quot;Task failed during execution with exit code -4. Please check task's output for error details.&quot;<br/><br/>The &quot;Output:&quot; box in the HPC Job manager is empty.  However if we look in the log file that captured stdout/stderr from the job, we see:</p> <pre> Aborting: smpd on MT-CCS-01 failed to communicate with smpd on MT-CCS-11 Other MPI error, error stack: ReadFailed(1317): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060) </pre> <p>No indiciation of failures on the logs at mt-ccs-11.  Under 2003, I would look for %windir%\pchealth\ErrorRep\UserDumps, but that directory isn't here.  Should I enable error reporing?  Any suggestions on how to diagnose?<br/><br/>Thanks for your help!<br/><br/></p>© 2009 Microsoft Corporation. All rights reserved.Tue, 12 May 2009 22:12:14 Z822cb08f-699f-41e2-ac73-016dc645175dhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/822cb08f-699f-41e2-ac73-016dc645175d#822cb08f-699f-41e2-ac73-016dc645175dhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/822cb08f-699f-41e2-ac73-016dc645175d#822cb08f-699f-41e2-ac73-016dc645175dChris Quirkhttp://social.microsoft.com/Profile/en-US/?user=Chris%20QuirkMPI job aborts for unexplained reasons<p>Hi there--<br/><br/>We're upgrading to HPC Pack 2008 from CCS Server 2003, moving our existing C++ and C# MPI programs to HPC Pack 2008.  Generally things seem to progress, but one C++ MPI program is failing.<br/><br/>We fire off a job, and things work for a while, then we get a crash.<br/><br/>In the task properties, the Error message is:  &quot;Task failed during execution with exit code -4. Please check task's output for error details.&quot;<br/><br/>The &quot;Output:&quot; box in the HPC Job manager is empty.  However if we look in the log file that captured stdout/stderr from the job, we see:</p> <pre> Aborting: smpd on MT-CCS-01 failed to communicate with smpd on MT-CCS-11 Other MPI error, error stack: ReadFailed(1317): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060) </pre> <p>No indiciation of failures on the logs at mt-ccs-11.  Under 2003, I would look for %windir%\pchealth\ErrorRep\UserDumps, but that directory isn't here.  Should I enable error reporing?  Any suggestions on how to diagnose?<br/><br/>Thanks for your help!<br/><br/></p>Tue, 12 May 2009 22:12:14 Z2009-05-12T22:12:14Z