MPI Ping-Pong failures
-
lunedì 16 maggio 2011 08:37
Hi everybody!
I'm trying to setup a cluster with one Headnode and four Compute Nodes (Win 2008 R2 HPC-Edition, HPC Pack Enterprise SP1). It is a Topology 3 Cluster.
All Network Status and Network Troubleshooting tests suceede, however when I try to run one of the MPI Performance tests they fail, at least most of the times! For the moment I'm testing with the headnode and one Computenode only
I observed error outputs from the tests as follows (Win Firewall is disabled for all networks):
Message: Aborting: smpd on MU13ES34002 is unable to connect to the msmpi service on MU13ES34002-CN2:8677 Other MPI error, error stack: ConnectFailed(943): unable to connect to 192.168.1.11 on port 8677, exhausted all endpoints ConnectFailed(934): unable to connect to 192.168.1.11 on port 8677, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
Message: Aborting: smpd on MU13ES34002 is unable to connect to the smpd manager on MU13ES34002-CN2:0 Other MPI error, error stack: ConnectFailed(943): unable to connect to 192.168.1.11 on port 55366, exhausted all endpoints ConnectFailed(934): unable to connect to 192.168.1.11 on port 55366, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
Message: [ MU13ES34002-CN2#1 ] Fatal Error: MPI Failure [ MU13ES34002-CN2#1 ] Error Details: [ MU13ES34002-CN2#1 ] Other MPI error, error stack: [ MU13ES34002-CN2#1 ] PMPI_Allgather(648).....: MPI_Allgather(sbuf=0x00000000000BFBB0, scount=128, MPI_CHAR, rbuf=0x00000000006A9E00, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed [ MU13ES34002-CN2#1 ] MPIR_Allgather(162).....: [ MU13ES34002-CN2#1 ] MPIC_Sendrecv(173)......: [ MU13ES34002-CN2#1 ] MPIDI_CH3I_Progress(235): [ MU13ES34002-CN2#1 ] ConnectFailed(951)......: [ch3:sock] failed to connnect to remote process 7F8B66E8-5BCC-481f-B3FE-B13BE92F5B48:0 [ MU13ES34002-CN2#1 ] ConnectFailed(943)......: unable to connect to 192.168.1.1 on port 60310, exhausted all endpoints [ MU13ES34002-CN2#1 ] ConnectFailed(934)......: unable job aborted: [ranks] message [0] terminated [1] application aborted aborting MPI_COMM_WORLD, error 250, comm rank 1 ---- error analysis ----- [1] on MU13ES34002-CN2 mpipingpong.exe aborted the job. abort code 250
This happens on the private as well as on the application network. I could get the Latency test running twice on the private network with quite good results (75 - 100 usecs latency, GigE Interface).
I have no idea how to solve this problem or how to further investigate what the problem could be.
So any advice would be most appreciated!!
Tutte le risposte
-
lunedì 16 maggio 2011 23:07
This looked like a netmask problem. MPI processes will use netmask to limit which hosts it allowed to passing the packet. To solve the specific problem you experienced, you have two solutions:
1. Set the environment variable of MPICH_NETMASK via the cluscfg command:
cluscfg setenvs MPICH_NETMASK=192.168.0.0/255.255.0.0
2. Set the env in mpiexec command:
mpiexec -env MPICH_NETMASK 192.168.0.0/255.255.0.0
Note that solution 2 takes precedence over 1, that is, if you have both set, the one in mpiexec will overwrite this env's default value.
Another thing you need pay attention is that the setting in 1 is global. It will take effect for all MPI jobs, while setting 2 only takes effect for the current job.
Please let me know whether it works for you.
Thanks,
James
- Contrassegnato come risposta Renderflash martedì 17 maggio 2011 06:59
-
martedì 17 maggio 2011 07:40
Thank you so much, you just saved my day!
The strange thing is that the CCP_MPI_NETMASK variable was set correctly (I assume this is set by the Network wizard). Shouldn't that variable do exactly what you described?
Anyways, thanks again, I owe you some beers ;-)