Hi there--
We're upgrading to HPC Pack 2008 from CCS Server 2003, moving our existing C++ and C# MPI programs to HPC Pack 2008. Generally things seem to progress, but one C++ MPI program is failing.
We fire off a job, and things work for a while, then we get a crash.
In the task properties, the Error message is: "Task failed during execution with exit code -4. Please check task's output for error details."
The "Output:" box in the HPC Job manager is empty. However if we look in the log file that captured stdout/stderr from the job, we see:
Aborting: smpd on MT-CCS-01 failed to communicate with smpd on MT-CCS-11
Other MPI error, error stack:
ReadFailed(1317): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
No indiciation of failures on the logs at mt-ccs-11. Under 2003, I would look for %windir%\pchealth\ErrorRep\UserDumps, but that directory isn't here. Should I enable error reporing? Any suggestions on how to diagnose?
Thanks for your help!