MPI Error
-
2011年11月8日 13:20
Hi.
I'm setting up a Windows HPC Server 2008 R2 SP1 cluster (HPC Pack 2008 R2 SP2). It contains one Head node and five workstations. The purpose of it is to run Abaqus jobs on the HPC. When running larger jobs (abaqus -job "jobname" -cpus 8 -input CCshort_layup_FO_load.inp) (smaller jobs execute successfully) I get the following error messages:
[0] process exited without calling finalize
[1] fatal error
Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(150)..........: MPI_Send(buf=0x0000000031C10040, count=833568, MPI_BYTE, dest=0, tag=0, comm=0x84000000) failed
MPIDI_CH3I_Progress(235):
RecvFailed(105).........:
ReadFailed(1274)........: An existing connection was forcibly closed by the remote host. (errno 10054)[2-3] terminated
[0] on client1
C:\Users\ADMINI~1\AppData\Local\Temp\Administrator_ava_2692\jobname.bat ended prematurely and may have crashed. exit code 0xc0000374All nodes are running Intel Core 2 CPU:s @ 2.13GHz with 2GB of RAM. Workstations have been installed with Windows 7 x64 SP1. Do you have any ideas what it is causing this behavior? Please help me out.
- 已编辑 Zorky 2011年11月11日 13:44
全部回复
-
2011年11月30日 19:09
There are several different problems that can lead to this error
1) Firewall issues: Is firewall enabled on the network on which this MPI job is run on? If firewall is enabled, is Abaqus added to the list of enabled application? You can use hpcfwutil to register abaqus (assuming the executable is abq692.exe)
clusrun hpcfwutil register abaqus C:\abaqus\692\abq692.exe
2) Network issues: If firewall isn't a problem, have you tried running MPI Diagnostics on the cluster to see if the network is functioning properly? Select the nodes that you want to run diagnostics on, and right click, choose "Run Diagnostics", then choose all three MPI diagnostics tests to run. Check the diagnostic results to make sure they all look good.
3) If the two above suggestions still do not work. Can you give more details on the cluster setup? In particular, the network configuration of the cluster (does MPI run on a private network, or enterprise network, is network direct enabled, etc..)
-
2011年12月10日 16:25
There are several different problems that can lead to this error
1) Firewall issues: Is firewall enabled on the network on which this MPI job is run on? If firewall is enabled, is Abaqus added to the list of enabled application? You can use hpcfwutil to register abaqus (assuming the executable is abq692.exe)
clusrun hpcfwutil register abaqus C:\abaqus\692\abq692.exe
2) Network issues: If firewall isn't a problem, have you tried running MPI Diagnostics on the cluster to see if the network is functioning properly? Select the nodes that you want to run diagnostics on, and right click, choose "Run Diagnostics", then choose all three MPI diagnostics tests to run. Check the diagnostic results to make sure they all look good.
3) If the two above suggestions still do not work. Can you give more details on the cluster setup? In particular, the network configuration of the cluster (does MPI run on a private network, or enterprise network, is network direct enabled, etc..)
Well, I'm having a similar problem after installing HPC Pack SP3, before I was using the HPC Pack R2. By the way firewall service is disabled on all the compute nodes. But I tried to enable them agian and used the 1st option that you said, but nothing changed. MPI Diagnostic is not working too, it gives error on all nodes. Here is the error of the Radioss run, hope you could help:
ROOT: shells_regular_00 RESTART: 0001
NUMBER OF HMPP PROCESSES 48
10/12/2011
job aborted:
[ranks] message
[0] fatal error
Fatal error in PMPI_Gather: Invalid buffer pointer, error stack:
PMPI_Gather(583): MPI_Gather(sbuf=0x000000000B70EC00, scount=10, MPI_DOUBLE_PRECISION, rbuf=0x000000000B70EC00, rcount=10, MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
PMPI_Gather(507): Buffers must not be aliased
[1-15] terminated
---- error analysis -----
[0] on HEXANALIZSRV01
mpi has detected a fatal error and aborted \\hexanalizsrv01\hw11\e_11.0_win64_msmpi.exe
---- error analysis ----- -
2012年1月4日 17:40
Can you give me the error of the failing MPI Diagnostics?
The error by Radioss suggested that the application is using the same buffer for the sending buffer and the receiving buffer, which is prohibited by the MPI standard (although some MPI implementations will allow it). The application has to use different buffers for sending/receiving or use MPI_IN_PLACE
http://www.mpi-forum.org/docs/mpi-20-html/node145.htm
--Anh
-
2012年1月28日 1:03
I was wondering what is the output of MPI Diagnostic for latency and throughput?
Thanks,
James
-
2012年1月31日 14:37
Hi every body!
I'am sorry that i haven't reply earlier.
The problem that cause my error messages was a corrupt file. The corrupt file was the job file "CCshort_layup_FO_load.inp". When i got a new job file that was working. I nerver received the error messages again.
Thank you all for trying to help me!
Best regards Zorky
Zorky HPC- 已标记为答案 Zorky 2012年1月31日 14:37