MPI with compute nodes on separate subnet from head node

Answered MPI with compute nodes on separate subnet from head node

  • lunedì 22 febbraio 2010 17:41
     
     
    We're trying to run a MATLAB MPI computation on our cluster and getting "no endpoint matches the netmask" errors.  Our compute nodes are on a private subnet (192.168.6.*) and our head node is on a public subnet (134.74.77.*).  We have a router set up to route traffic between the compute nodes and the head node.  However, we tried submitting a MATLAB MPI job.  The command line task was: mpiexec -l -genvlist MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION,CCP_NODES,CCP_JOBID -hosts %CCP_NODES% "d:\matlab\r2009a\bin\worker.bat"  -parallel
    and the Environment variables were: MDCE_DECODE_FUNCTION=decodeCcsSingleParallelTask,MDCE_STORAGE_LOCATION=PC{\\masternode\hpctemp\matlabtemp}:UNIX{}:,MDCE_STORAGE_CONSTRUCTOR=makeFileStorageObject,MDCE_JOB_LOCATION=Job11
    We got the following error output.  Any ideas?
    Thanks,
    Eli


    job aborted:
    [ranks] message

    [0] terminated

    [1] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353EAC30) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 1 unable to connect to rank 9 using business card <port=54718 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=16112:3556 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54718, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [2] terminated

    [3] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353B3290) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 3 unable to connect to rank 11 using business card <port=54716 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=4168:3564 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54716, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [4] terminated

    [5] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353A8BC0) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 5 unable to connect to rank 13 using business card <port=54706 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=11120:3560 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54706, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [6] terminated

    [7] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353BA530) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 7 unable to connect to rank 15 using business card <port=54710 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=7068:3560 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54710, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [8-39] terminated

    ---- error analysis -----

    [1,3,5,7] on NODE1
    mpi has detected a fatal error and aborted d:\matlab\r2009a\bin\worker.bat

    ---- error analysis -----

Tutte le risposte

  • martedì 23 febbraio 2010 03:42
     
     
    Hi,

    Is your headnode also a compute node? If so, can you try:
    1) make headnode offline
    2) set cluster env var CCP_MPINETMASK to your private netmask: cluscfg setenvs CCP_MPINETMASK <YourPrivateNetMask>
    3) run the above job again

    Hope it helps,

    Liwei
  • martedì 23 febbraio 2010 03:43
     
     
    The head node is not a compute node.  How do I set those cluster variables?
    Thanks.
    Eli.

    Edit: OK, I shouldn't respond to these things at night.  So, you mean I should run, on the head node:  cluscfg setenvs CCP_MPINETMASK 192.168.6.0 ?
    Or am I missing something.
  • mercoledì 24 febbraio 2010 14:58
     
     Con risposta

    Hi, you want to run cluscfg like below:

     cluscfg setenvs CCP_MPINETMASK <IP>/<Netmask>

    <IP> is something: 192.168.5.0 or 192.168.6.0. NOTE: from the mpi error message, it is 192.168.5.x. from what you said, it is 192.168.6.0. Please make sure you get the right one. You may run "ipconfig" to figure it out.

    <Netmask> is something like 255.255.255.0. run 'ipconfig', it will tell you your subnet mask.

    e.g. >ipconfig
       IPv4 Address. . . . . . . . . . . : 192.168.2.101
       Subnet Mask . . . . . . . . . . . : 255.255.255.0
       Default Gateway . . . . . . . . . : 192.168.2.1

    • Contrassegnato come risposta elansey mercoledì 24 febbraio 2010 16:40
    •  
  • mercoledì 24 febbraio 2010 16:30
     
     
     cluscfg setenvs CCP_MPINETMASK=192.168.5.0/255.255.255.0
    or
     cluscfg setenvs CCP_MPINETMASK=192.168.5.0/24

    Like that?
  • mercoledì 24 febbraio 2010 16:38
     
     Con risposta
     cluscfg setenvs CCP_MPI_NETMASK=192.168.5.0/255.255.255.0
    Works.  Woo hoo!
    Thanks.
    • Contrassegnato come risposta elansey mercoledì 24 febbraio 2010 16:40
    •