MPI with compute nodes on separate subnet from head node
-
lunedì 22 febbraio 2010 17:41We're trying to run a MATLAB MPI computation on our cluster and getting "no endpoint matches the netmask" errors. Our compute nodes are on a private subnet (192.168.6.*) and our head node is on a public subnet (134.74.77.*). We have a router set up to route traffic between the compute nodes and the head node. However, we tried submitting a MATLAB MPI job. The command line task was: mpiexec -l -genvlist MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION,CCP_NODES,CCP_JOBID -hosts %CCP_NODES% "d:\matlab\r2009a\bin\worker.bat" -paralleland the Environment variables were: MDCE_DECODE_FUNCTION=decodeCcsSingleParallelTask,MDCE_STORAGE_LOCATION=PC{\\masternode\hpctemp\matlabtemp}:UNIX{}:,MDCE_STORAGE_CONSTRUCTOR=makeFileStorageObject,MDCE_JOB_LOCATION=Job11We got the following error output. Any ideas?Thanks,Elijob aborted:[ranks] message[0] terminated[1] fatal errorFatal error in MPI_Comm_dup: Other MPI error, error stack:MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353EAC30) failedMPIR_Comm_copy(625)..........:MPIR_Get_contextid(318)......:MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failedMPIR_Allreduce(260)..........:MPIC_Sendrecv(120)...........:MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager messageMPIDI_CH3_iSendv(239)........:MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 1 unable to connect to rank 9 using business card <port=54718 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=16112:3556 >MPIDU_Sock_post_connect(1161):save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10 on port 54718, no endpoint matches the netmask 134.74.77.0/255.255.255.0[2] terminated[3] fatal errorFatal error in MPI_Comm_dup: Other MPI error, error stack:MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353B3290) failedMPIR_Comm_copy(625)..........:MPIR_Get_contextid(318)......:MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failedMPIR_Allreduce(260)..........:MPIC_Sendrecv(120)...........:MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager messageMPIDI_CH3_iSendv(239)........:MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 3 unable to connect to rank 11 using business card <port=54716 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=4168:3564 >MPIDU_Sock_post_connect(1161):save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10 on port 54716, no endpoint matches the netmask 134.74.77.0/255.255.255.0[4] terminated[5] fatal errorFatal error in MPI_Comm_dup: Other MPI error, error stack:MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353A8BC0) failedMPIR_Comm_copy(625)..........:MPIR_Get_contextid(318)......:MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failedMPIR_Allreduce(260)..........:MPIC_Sendrecv(120)...........:MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager messageMPIDI_CH3_iSendv(239)........:MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 5 unable to connect to rank 13 using business card <port=54706 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=11120:3560 >MPIDU_Sock_post_connect(1161):save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10 on port 54706, no endpoint matches the netmask 134.74.77.0/255.255.255.0[6] terminated[7] fatal errorFatal error in MPI_Comm_dup: Other MPI error, error stack:MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353BA530) failedMPIR_Comm_copy(625)..........:MPIR_Get_contextid(318)......:MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failedMPIR_Allreduce(260)..........:MPIC_Sendrecv(120)...........:MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager messageMPIDI_CH3_iSendv(239)........:MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 7 unable to connect to rank 15 using business card <port=54710 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=7068:3560 >MPIDU_Sock_post_connect(1161):save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10 on port 54710, no endpoint matches the netmask 134.74.77.0/255.255.255.0[8-39] terminated---- error analysis -----[1,3,5,7] on NODE1mpi has detected a fatal error and aborted d:\matlab\r2009a\bin\worker.bat---- error analysis -----
Tutte le risposte
-
martedì 23 febbraio 2010 03:42Hi,
Is your headnode also a compute node? If so, can you try:
1) make headnode offline
2) set cluster env var CCP_MPINETMASK to your private netmask: cluscfg setenvs CCP_MPINETMASK <YourPrivateNetMask>
3) run the above job again
Hope it helps,
Liwei -
martedì 23 febbraio 2010 03:43The head node is not a compute node. How do I set those cluster variables?Thanks.
Eli.Edit: OK, I shouldn't respond to these things at night. So, you mean I should run, on the head node: cluscfg setenvs CCP_MPINETMASK 192.168.6.0 ?Or am I missing something. -
mercoledì 24 febbraio 2010 14:58
Hi, you want to run cluscfg like below:
cluscfg setenvs CCP_MPINETMASK <IP>/<Netmask>
<IP> is something: 192.168.5.0 or 192.168.6.0. NOTE: from the mpi error message, it is 192.168.5.x. from what you said, it is 192.168.6.0. Please make sure you get the right one. You may run "ipconfig" to figure it out.
<Netmask> is something like 255.255.255.0. run 'ipconfig', it will tell you your subnet mask.
e.g. >ipconfig
IPv4 Address. . . . . . . . . . . : 192.168.2.101
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 192.168.2.1- Contrassegnato come risposta elansey mercoledì 24 febbraio 2010 16:40
-
mercoledì 24 febbraio 2010 16:30cluscfg setenvs CCP_MPINETMASK=192.168.5.0/255.255.255.0orcluscfg setenvs CCP_MPINETMASK=192.168.5.0/24
Like that? -
mercoledì 24 febbraio 2010 16:38
cluscfg setenvs CCP_MPI_NETMASK=192.168.5.0/255.255.255.0Works. Woo hoo!Thanks.- Contrassegnato come risposta elansey mercoledì 24 febbraio 2010 16:40