none
MPI Ping-Pong failures

    Question

  • Hi everybody!

     

    I'm trying to setup a cluster with one Headnode and four Compute Nodes (Win 2008 R2 HPC-Edition, HPC Pack Enterprise SP1). It is a Topology 3 Cluster.

    All Network Status and Network Troubleshooting tests suceede, however when I try to run one of the MPI Performance tests they fail, at least most of the times! For the moment I'm testing with the headnode and one Computenode only

    I observed error outputs from the tests as follows (Win Firewall is disabled for all networks):

     

    Message: 
    Aborting: smpd on MU13ES34002 is unable to connect to the msmpi service on MU13ES34002-CN2:8677
    Other MPI error, error stack:
    ConnectFailed(943): unable to connect to 192.168.1.11 on port 8677, exhausted all endpoints
    ConnectFailed(934): unable to connect to 192.168.1.11 on port 8677, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
    
    
    Message: 
    Aborting: smpd on MU13ES34002 is unable to connect to the smpd manager on MU13ES34002-CN2:0
    Other MPI error, error stack:
    ConnectFailed(943): unable to connect to 192.168.1.11 on port 55366, exhausted all endpoints
    ConnectFailed(934): unable to connect to 192.168.1.11 on port 55366, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
    
    
    
    

     

    Message: 
    [ MU13ES34002-CN2#1 ] Fatal Error: MPI Failure
    [ MU13ES34002-CN2#1 ] Error Details:
    [ MU13ES34002-CN2#1 ]  Other MPI error, error stack:
    [ MU13ES34002-CN2#1 ]  PMPI_Allgather(648).....: MPI_Allgather(sbuf=0x00000000000BFBB0, scount=128, MPI_CHAR, rbuf=0x00000000006A9E00, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed
    [ MU13ES34002-CN2#1 ]  MPIR_Allgather(162).....: 
    [ MU13ES34002-CN2#1 ]  MPIC_Sendrecv(173)......: 
    [ MU13ES34002-CN2#1 ]  MPIDI_CH3I_Progress(235): 
    [ MU13ES34002-CN2#1 ]  ConnectFailed(951)......: [ch3:sock] failed to connnect to remote process 7F8B66E8-5BCC-481f-B3FE-B13BE92F5B48:0
    [ MU13ES34002-CN2#1 ]  ConnectFailed(943)......: unable to connect to 192.168.1.1 on port 60310, exhausted all endpoints
    [ MU13ES34002-CN2#1 ]  ConnectFailed(934)......: unable
    
    
    job aborted:
    [ranks] message
    
    [0] terminated
    
    [1] application aborted
    aborting MPI_COMM_WORLD, error 250, comm rank 1
    
    ---- error analysis -----
    
    [1] on MU13ES34002-CN2
    mpipingpong.exe aborted the job. abort code 250
    
    

    This happens on the private as well as on the application network. I could get the Latency test running twice on the private network with quite good results (75 - 100 usecs latency, GigE Interface).

    I have no idea how to solve this problem or how to further investigate what the problem could be.

    So any advice would be most appreciated!!

    Monday, May 16, 2011 8:37 AM

Answers

  • This looked like a netmask problem. MPI processes will use netmask to limit which hosts it allowed to passing the packet. To solve the specific problem you experienced, you have two solutions:

    1. Set the environment variable of MPICH_NETMASK via the cluscfg command:

    cluscfg setenvs MPICH_NETMASK=192.168.0.0/255.255.0.0

     

    2. Set the env in mpiexec command:

    mpiexec -env MPICH_NETMASK 192.168.0.0/255.255.0.0

    Note that solution 2 takes precedence over 1, that is, if you have both set, the one in mpiexec will overwrite this env's default value.

    Another thing you need pay attention is that the setting in 1 is global. It will take effect for all MPI jobs, while setting 2 only takes effect for the current job.

    Please let me know whether it works for you.

    Thanks,

    James

    • Marked as answer by Renderflash Tuesday, May 17, 2011 6:59 AM
    Monday, May 16, 2011 11:07 PM

All replies

  • This looked like a netmask problem. MPI processes will use netmask to limit which hosts it allowed to passing the packet. To solve the specific problem you experienced, you have two solutions:

    1. Set the environment variable of MPICH_NETMASK via the cluscfg command:

    cluscfg setenvs MPICH_NETMASK=192.168.0.0/255.255.0.0

     

    2. Set the env in mpiexec command:

    mpiexec -env MPICH_NETMASK 192.168.0.0/255.255.0.0

    Note that solution 2 takes precedence over 1, that is, if you have both set, the one in mpiexec will overwrite this env's default value.

    Another thing you need pay attention is that the setting in 1 is global. It will take effect for all MPI jobs, while setting 2 only takes effect for the current job.

    Please let me know whether it works for you.

    Thanks,

    James

    • Marked as answer by Renderflash Tuesday, May 17, 2011 6:59 AM
    Monday, May 16, 2011 11:07 PM
  • Thank you so much, you just saved my day!

    The strange thing is that the CCP_MPI_NETMASK variable was set correctly (I assume this is set by the Network wizard). Shouldn't that variable do exactly what you described?

    Anyways, thanks again, I owe you some beers ;-)

    Tuesday, May 17, 2011 7:40 AM