MPI communication problems using multiple nodes.
-
2010年1月19日 17:31In attempting to execute the following task on the HPC cluster:
mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe
We are receiving the following error from HPC:
job aborted:
[ranks] message[0] terminated
[1] fatal error
Fatal error in MPI_Comm_dup: Other MPI error, error stack:
MPI_Comm_dup(171).......: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000002EE750) failed
MPIR_Comm_copy(625).....:
MPIR_Get_contextid(318).:
MPI_Allreduce(666)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x00000000002EE500, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
MPIR_Allreduce(259).....:
MPIC_Sendrecv(123)......:
MPIC_Wait(277)..........:
MPIDI_CH3I_Progress(244): handle_sock_op failed
ConnectFailed(1061).....: [ch3:sock] failed to connnect to remote process 698535A4-19A3-4d13-BE18-7CB84774A059:0
ConnectFailed(986)......: unable to connect to 10.138.147.15 on port 63829, exhausted all endpoints
ConnectFailed(977)......: unable to connect to 10.138.147.15 on port 63829, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)---- error analysis -----
[1] on PETDEVHPC02
mpi has detected a fatal error and aborted tstconsoleapp.exe---- error analysis -----
We are running an enterprise network, and diagnostics on the MPI passing method reveals the following, which looks ok to us:C:\Windows\system32>cluscfg listenvs
WCF_NETWORKPREFIX=Enterprise
CCP_MPI_NETMASK=10.138.144.0/255.255.248.0
CCP_CLUSTER_NAME=PETDEVHPC03
Can anyone help with this issue please?
Many Thanks
Richard
すべての返信
-
2010年1月19日 21:51Hi Richard,
Can you double check the way you run the MPI task?
To run your cmd 'mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe', the scheduler must require these 2 nodes. from the cmd line, it will be something like
job submit /numnodes:2 /askednodes:PETDEVHPC01,PETDEVHPC02 [OtherOptions] mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe
Hope this can resolve your issue
Liwei -
2010年1月19日 22:01
Hi Richard,
From the error message, it seems more like a firewall issue. When you are running MPI programs using enterprise-only network, the firewalls on all the nodes should either
1) open the ports for all MPI programs/service including your MPI test.
or
2) turn off all the firewalls.
I recommend that you try 2) first. If it works, then you may consider 1) to make it more secure.
Liwei- 回答としてマーク Don PatteeModerator 2011年1月12日 2:50
-
2011年5月20日 0:41
You can use hpcfwutil to add an exception for your app.
For example:
clusrun hpcfwutil register lizard c:\apps\lizard\xhplmkl.exe