none
MPI with compute nodes on separate subnet from head node RRS feed

  • Question

  • We're trying to run a MATLAB MPI computation on our cluster and getting "no endpoint matches the netmask" errors.  Our compute nodes are on a private subnet (192.168.6.*) and our head node is on a public subnet (134.74.77.*).  We have a router set up to route traffic between the compute nodes and the head node.  However, we tried submitting a MATLAB MPI job.  The command line task was: mpiexec -l -genvlist MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION,CCP_NODES,CCP_JOBID -hosts %CCP_NODES% "d:\matlab\r2009a\bin\worker.bat"  -parallel
    and the Environment variables were: MDCE_DECODE_FUNCTION=decodeCcsSingleParallelTask,MDCE_STORAGE_LOCATION=PC{\\masternode\hpctemp\matlabtemp}:UNIX{}:,MDCE_STORAGE_CONSTRUCTOR=makeFileStorageObject,MDCE_JOB_LOCATION=Job11
    We got the following error output.  Any ideas?
    Thanks,
    Eli


    job aborted:
    [ranks] message

    [0] terminated

    [1] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353EAC30) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 1 unable to connect to rank 9 using business card <port=54718 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=16112:3556 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54718, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [2] terminated

    [3] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353B3290) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 3 unable to connect to rank 11 using business card <port=54716 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=4168:3564 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54716, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [4] terminated

    [5] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353A8BC0) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 5 unable to connect to rank 13 using business card <port=54706 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=11120:3560 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54706, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [6] terminated

    [7] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(172)............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000353BA530) failed
    MPIR_Comm_copy(625)..........: 
    MPIR_Get_contextid(318)......: 
    MPI_Allreduce(667)...........: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x000000000102CE70, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(260)..........: 
    MPIC_Sendrecv(120)...........: 
    MPIDI_EagerContigIsend(519)..: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv(239)........: 
    MPIDI_CH3I_Sock_connect(358).: [ch3:sock] rank 7 unable to connect to rank 15 using business card <port=54710 description="192.168.5.10 Node10 " shm_host=Node10 shm_queue=7068:3560 >
    MPIDU_Sock_post_connect(1161): 
    save_valid_endpoints(1090)...: unable to connect to 192.168.5.10 Node10  on port 54710, no endpoint matches the netmask 134.74.77.0/255.255.255.0

    [8-39] terminated

    ---- error analysis -----

    [1,3,5,7] on NODE1
    mpi has detected a fatal error and aborted d:\matlab\r2009a\bin\worker.bat

    ---- error analysis -----
    Monday, February 22, 2010 5:41 PM

Answers

  • Hi, you want to run cluscfg like below:

     cluscfg setenvs CCP_MPINETMASK <IP>/<Netmask>

    <IP> is something: 192.168.5.0 or 192.168.6.0. NOTE: from the mpi error message, it is 192.168.5.x. from what you said, it is 192.168.6.0. Please make sure you get the right one. You may run "ipconfig" to figure it out.

    <Netmask> is something like 255.255.255.0. run 'ipconfig', it will tell you your subnet mask.

    e.g. >ipconfig
       IPv4 Address. . . . . . . . . . . : 192.168.2.101
       Subnet Mask . . . . . . . . . . . : 255.255.255.0
       Default Gateway . . . . . . . . . : 192.168.2.1

    • Marked as answer by elansey Wednesday, February 24, 2010 4:40 PM
    Wednesday, February 24, 2010 2:58 PM
  •  cluscfg setenvs CCP_MPI_NETMASK=192.168.5.0/255.255.255.0
    Works.  Woo hoo!
    Thanks.
    • Marked as answer by elansey Wednesday, February 24, 2010 4:40 PM
    Wednesday, February 24, 2010 4:38 PM

All replies

  • Hi,

    Is your headnode also a compute node? If so, can you try:
    1) make headnode offline
    2) set cluster env var CCP_MPINETMASK to your private netmask: cluscfg setenvs CCP_MPINETMASK <YourPrivateNetMask>
    3) run the above job again

    Hope it helps,

    Liwei
    Tuesday, February 23, 2010 3:42 AM
  • The head node is not a compute node.  How do I set those cluster variables?
    Thanks.
    Eli.

    Edit: OK, I shouldn't respond to these things at night.  So, you mean I should run, on the head node:  cluscfg setenvs CCP_MPINETMASK 192.168.6.0 ?
    Or am I missing something.
    Tuesday, February 23, 2010 3:43 AM
  • Hi, you want to run cluscfg like below:

     cluscfg setenvs CCP_MPINETMASK <IP>/<Netmask>

    <IP> is something: 192.168.5.0 or 192.168.6.0. NOTE: from the mpi error message, it is 192.168.5.x. from what you said, it is 192.168.6.0. Please make sure you get the right one. You may run "ipconfig" to figure it out.

    <Netmask> is something like 255.255.255.0. run 'ipconfig', it will tell you your subnet mask.

    e.g. >ipconfig
       IPv4 Address. . . . . . . . . . . : 192.168.2.101
       Subnet Mask . . . . . . . . . . . : 255.255.255.0
       Default Gateway . . . . . . . . . : 192.168.2.1

    • Marked as answer by elansey Wednesday, February 24, 2010 4:40 PM
    Wednesday, February 24, 2010 2:58 PM
  •  cluscfg setenvs CCP_MPINETMASK=192.168.5.0/255.255.255.0
    or
     cluscfg setenvs CCP_MPINETMASK=192.168.5.0/24

    Like that?
    Wednesday, February 24, 2010 4:30 PM
  •  cluscfg setenvs CCP_MPI_NETMASK=192.168.5.0/255.255.255.0
    Works.  Woo hoo!
    Thanks.
    • Marked as answer by elansey Wednesday, February 24, 2010 4:40 PM
    Wednesday, February 24, 2010 4:38 PM
  • Hi I know this thread is very old.. Please help me on resolving my issue. I was getting error like Rank 0 unable to connect to Rank 1, now with the help of your reply I resolved it. However now I am getting error like Rank 1 unable to connect to Rank 0.

    Your help on this highly appreciated.

    Thursday, February 12, 2015 5:45 PM