none
MPI communication problems using multiple nodes. RRS feed

  • Question

  • In attempting to execute the following task on the HPC cluster:

    mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe 

    We are receiving the following error from HPC:

    job aborted:
    [ranks] message

    [0] terminated

    [1] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(171).......: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000002EE750) failed
    MPIR_Comm_copy(625).....:
    MPIR_Get_contextid(318).:
    MPI_Allreduce(666)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x00000000002EE500, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(259).....:
    MPIC_Sendrecv(123)......:
    MPIC_Wait(277)..........:
    MPIDI_CH3I_Progress(244): handle_sock_op failed
    ConnectFailed(1061).....: [ch3:sock] failed to connnect to remote process 698535A4-19A3-4d13-BE18-7CB84774A059:0
    ConnectFailed(986)......: unable to connect to 10.138.147.15 on port 63829, exhausted all endpoints
    ConnectFailed(977)......: unable to connect to 10.138.147.15 on port 63829, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (errno 10060)

    ---- error analysis -----

    [1] on PETDEVHPC02
    mpi has detected a fatal error and aborted tstconsoleapp.exe

    ---- error analysis -----

    We are running an enterprise network, and diagnostics on the MPI passing method reveals the following, which looks ok to us:

     C:\Windows\system32>cluscfg listenvs

    WCF_NETWORKPREFIX=Enterprise

    CCP_MPI_NETMASK=10.138.144.0/255.255.248.0

    CCP_CLUSTER_NAME=PETDEVHPC03

    Can anyone help with this issue please?

    Many Thanks

    Richard

     

     

    Tuesday, January 19, 2010 5:31 PM

Answers

  • Hi Richard,

    From the error message, it seems more like a firewall issue. When you are running MPI programs using enterprise-only network, the firewalls on all the nodes should either
      1) open the ports for all MPI programs/service including your MPI test.
    or
      2) turn off all the firewalls.

    I recommend that you try 2) first. If it works, then you may consider 1) to make it more secure.

    Liwei
    Tuesday, January 19, 2010 10:01 PM

All replies

  • Hi Richard,

    Can you double check the way you run the MPI task?

    To run your cmd 'mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe', the scheduler must require these 2 nodes. from the cmd line, it will be something like

    job submit /numnodes:2 /askednodes:PETDEVHPC01,PETDEVHPC02  [OtherOptions]  mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe

    Hope this can resolve your issue

    Liwei
    Tuesday, January 19, 2010 9:51 PM
  • Hi Richard,

    From the error message, it seems more like a firewall issue. When you are running MPI programs using enterprise-only network, the firewalls on all the nodes should either
      1) open the ports for all MPI programs/service including your MPI test.
    or
      2) turn off all the firewalls.

    I recommend that you try 2) first. If it works, then you may consider 1) to make it more secure.

    Liwei
    Tuesday, January 19, 2010 10:01 PM
  • You can use hpcfwutil to add an exception for your app.

     

    For example:

     

    clusrun hpcfwutil register lizard c:\apps\lizard\xhplmkl.exe

    Friday, May 20, 2011 12:41 AM