Note: Forums will be making significant UX changes to address key usability improvements surrounding search, discoverability and navigation. To learn more about these changes please visit the announcement which can be found HERE.
MPI communication problems using multiple nodes.

Odpovědět MPI communication problems using multiple nodes.

  • 2010年1月19日 17:31
     
     
    In attempting to execute the following task on the HPC cluster:

    mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe 

    We are receiving the following error from HPC:

    job aborted:
    [ranks] message

    [0] terminated

    [1] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(171).......: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x00000000002EE750) failed
    MPIR_Comm_copy(625).....:
    MPIR_Get_contextid(318).:
    MPI_Allreduce(666)......: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x00000000002EE500, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(259).....:
    MPIC_Sendrecv(123)......:
    MPIC_Wait(277)..........:
    MPIDI_CH3I_Progress(244): handle_sock_op failed
    ConnectFailed(1061).....: [ch3:sock] failed to connnect to remote process 698535A4-19A3-4d13-BE18-7CB84774A059:0
    ConnectFailed(986)......: unable to connect to 10.138.147.15 on port 63829, exhausted all endpoints
    ConnectFailed(977)......: unable to connect to 10.138.147.15 on port 63829, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (errno 10060)

    ---- error analysis -----

    [1] on PETDEVHPC02
    mpi has detected a fatal error and aborted tstconsoleapp.exe

    ---- error analysis -----

    We are running an enterprise network, and diagnostics on the MPI passing method reveals the following, which looks ok to us:

     C:\Windows\system32>cluscfg listenvs

    WCF_NETWORKPREFIX=Enterprise

    CCP_MPI_NETMASK=10.138.144.0/255.255.248.0

    CCP_CLUSTER_NAME=PETDEVHPC03

    Can anyone help with this issue please?

    Many Thanks

    Richard

     

     

全部回复

  • 2010年1月19日 21:51
     
     
    Hi Richard,

    Can you double check the way you run the MPI task?

    To run your cmd 'mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe', the scheduler must require these 2 nodes. from the cmd line, it will be something like

    job submit /numnodes:2 /askednodes:PETDEVHPC01,PETDEVHPC02  [OtherOptions]  mpiexec -hosts 2 PETDEVHPC01 1 PETDEVHPC02 1 tstconsoleapp.exe

    Hope this can resolve your issue

    Liwei
  • 2010年1月19日 22:01
     
     已答复
    Hi Richard,

    From the error message, it seems more like a firewall issue. When you are running MPI programs using enterprise-only network, the firewalls on all the nodes should either
      1) open the ports for all MPI programs/service including your MPI test.
    or
      2) turn off all the firewalls.

    I recommend that you try 2) first. If it works, then you may consider 1) to make it more secure.

    Liwei
  • 2011年5月20日 0:41
     
     

    You can use hpcfwutil to add an exception for your app.

     

    For example:

     

    clusrun hpcfwutil register lizard c:\apps\lizard\xhplmkl.exe