We're upgrading to HPC Pack 2008 from CCS Server 2003, moving our existing C++ and C# MPI programs to HPC Pack 2008. Generally things seem to progress, but one C++ MPI program is failing.
We fire off a job, and things work for a while, then we get a crash.
In the task properties, the Error message is: "Task failed during execution with exit code -4. Please check task's output for error details."
The "Output:" box in the HPC Job manager is empty. However if we look in the log file that captured stdout/stderr from the job, we see:
Aborting: smpd on MT-CCS-01 failed to communicate with smpd on MT-CCS-11 Other MPI error, error stack: ReadFailed(1317): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
No indiciation of failures on the logs at mt-ccs-11. Under 2003, I would look for %windir%\pchealth\ErrorRep\UserDumps, but that directory isn't here. Should I enable error reporing? Any suggestions on how to diagnose?
Thanks for your help!
2009年12月9日 6:32版主Chris, it doesn't look like anyone ever responded to your post. Sorry about that.Were you able to get your application working on our 2008 version? You shouldn't have had to do much for it to work...
2010年1月20日 0:20Chris, I have seen this kind of error before. The problem is that the cluster network messed up. Reconfiging the network fixed the problem. Could you try reconfiging the network and see whether this can solve your problem?
2010年1月20日 3:15Hi Chris,
From the error message, it seems to be related to firewalls which blocks MPI communications. Can you try the following?
1) turn off firewalls on all compute nodes.
If this can fix the issue, it is related to firewalls.
2) If 1) works, you may restore firewalls. then add your MPI program to firewall exceptions on all active domains on all the Compute nodes.
hope this helps
- 已标记为答案 Don PatteeModerator 2011年1月12日 2:50