none
beginner question run error in a cluster RRS feed

  • Question

  • _out.txt

    Master process 0 is running on 'cluster1'.

    job aborted:
    [ranks] message

    [0] terminated

    [1] fatal error
    Fatal error in MPI_Send: Other MPI error, error stack:
    MPI_Send(175)...........: MPI_Send(buf=0x000000000019F740, count=41, MPI_CHAR, dest=0, tag=0, MPI_COMM_WORLD) failed
    MPIDI_CH3I_Progress(244): handle_sock_op failed
    ConnectFailed(1061).....: [ch3:sock] failed to connnect to remote process 9294E292-D6B2-41a4-89B2-573449D1AEEA:0
    ConnectFailed(986)......: unable to connect to 192.168.10.10 on port 55979, exhausted all endpoints
    ConnectFailed(977)......: unable to connect to 192.168.10.10 on port 55979, Connection attempt failed because there is no correct answer or the connected host does not respond。  (errno 10060)

    ---- error analysis -----

    [1] on CLUSTER2
    mpi has detected a fatal error and aborted MPIProject.exe

    ---- error analysis -----

    how this happened and what can i do to solve this problem.

    Wednesday, October 3, 2012 7:21 AM

All replies

  • looks like some connection failure, can you check the following?

    1. firewall settings on cluster1 and cluster 2

    2. connectivity between the two, try toi ping '192.168.10.10 from cluster2


    Friday, October 5, 2012 12:09 AM
  • This may be related to MPI framework.In my case error 10060 happens randomly when I run one job on +250 cores.This looks like sort of MPI defect.

    Daniel Drypczewski

    Thursday, November 8, 2012 7:53 AM
  • Hi Daniel,

    We usually see this error due to two different scenarios when using sockets with MSMPI.  The first is as Michael pointed out in his original reply, that the firewall is blocking the app.  By default, the MPI service, the MPI process manager, and mpiexec have firewall rules created during compute node deployment in an HPC cluster environment.  Individual MPI applications however do not, and users must ensure that their application can get through the firewall.  This condition would manifest itself 100% of the time, so is unlikely to be your issue.

    I suspect you are hitting the second issue, which is connections at the TCP/IP level timing out.  If you have many processes connecting to one concurrently, the OS will start rejecting connections if the backlog gets too high, or if the application (MSMPI in this case) does not accept the connection in a timely manner.  By default, MSMPI will retry 5 times for any connection.  If the process being connected to is busy doing some computation, it's likely your are using up all retries before getting around to accepting the connection (MSMPI does not do this in the background currently, so you must make some call into MPI to make progress on connections).  There is an environment variable, MPICH_CONNECT_RETRIES, to let you increase the number of connection retries and I would suggest playing around with different values to see if your problem goes away.  Since you say the problem is random, you may not need to increase the retry count much.  You could start by adding "-env MPICH_CONNECT_RETRIES 10" to your mpiexec command line, and then adjusting the number of retries based on experimentation in your actual environment.

    Hope that helps,
    -Fab

    Thursday, June 20, 2013 6:29 PM
  • Hello,

    I am using HPC 2016. I have 1 head-node (8-cores) and 1 compute-node (8-cores). So, in total I have 16 cores available to run a job. 
    When I try to run a simple job i.e. HelloWorld example using both the nodes and using all cores available i.e. 16, the job runs absolutely fine.

    Now, the Problem is:
    My application contains mpi_app.exe. When I try to run my own application program as a job, it shows the following error:

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------
    2019-11-27 15:13:32 DEBUG main() C:\Quellen\Integral_7_18\main\src\mpi\mpi_app\mpi_app.cpp(314) : start mpi_app
    2019-11-27 15:13:32 DEBUG main() C:\Quellen\Integral_7_18\main\src\mpi\mpi_app\mpi_app.cpp(314) : start mpi_app
    2019-11-27 15:13:32 DEBUG main() C:\Quellen\Integral_7_18\main\src\mpi\mpi_app\mpi_app.cpp(314) : start mpi_app
    2019-11-27 15:13:32 DEBUG main() C:\Quellen\Integral_7_18\main\src\mpi\mpi_app\mpi_app.cpp(314) : start mpi_app
    2019-11-27 15:13:33 DEBUG main() C:\Quellen\Integral_7_18\main\src\mpi\mpi_app\mpi_app.cpp(314) : start mpi_app
    2019-11-27 15:13:33 DEBUG main() C:\Quellen\Integral_7_18\main\src\mpi\mpi_app\mpi_app.cpp(314) : start mpi_app

    job aborted:
    [ranks] message

    [0-3] terminated

    [4] fatal error
    Fatal error in MPI_Allgather: Other MPI error, error stack:
    MPI_Allgather(sbuf=0x00000093A6B8F220, scount=128, MPI_CHAR, rbuf=0x00000280D127CC60, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed
    [ch3:sock] failed to connnect to remote process c2128a76-54e2-46ab-aa84-f0e25a3e06d5:2
    unable to connect to 134.130.175.7 on port 62453, exhausted all endpoints
    unable to connect to 134.130.175.7 on port 62453, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (errno 10060)

    [5] terminated

    ---- error analysis -----

    [4] on SRV-HPC2016-01
    mpi has detected a fatal error and aborted mpi_app.exe

    ---- error analysis -----
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------

    As it shows the error for mpi_app.exe. 
    134.130.175.7 is the head-node.
    First of all, I check the connection between both the nodes i.e. compute-node and head-node and it’s fine (by pinging and by running HelloWorld example).

    When I add the mpi_app.exe in the Firewall Settings in compute-node and in head-node in Inbound Rules and enable this Rule, then it works. But for this, I have to give the Absolute path every time where the mpi_app.exe exists.
    And when I disable this rule i.e. mpi_app.exe, then it shows the above error.

    Further, when I try to run my own application program as a job in Microsoft HPC2012, the job runs absolutely fine. I didn’t add any extra Inbound Rule in Firewall Settings in HPC2012.

    The difference between Microsoft HPC 2012 default Firewall Settings and Microsoft HPC 2016 default Firewall Settings is the following:

    In Microsoft HPC 2012, it have the following MPI rules by default:
    MSMPI-ETl2CLOG
    MSMPI-ETL2OTF
    MSMPI-MPISYNC
    MSMPI-MPIEXEC
    MSMPI-SMPD

    In Microsoft HPC 2016, it have the following MPI rules by default:
    MSMPI-LaunchSvc
    MSMPI-MPIEXEC
    MSMPI-SMPD

    So, I don’t get why the job is running absolutely fine when I use Microsoft HPC 2012 default Firewall Settings and it’s not working with Microsoft HPC 2016 default Firewall Settings (as I have to add mpi_app.exe Rule and give the Absolute path).

    Can you please have a look at it and tell me where the actual problem is?

    Thank you very much.

    Regards,
    Choudhry Sharjeel Ahmad
    Thursday, November 28, 2019 10:07 AM
  • Hi Sharjeel,

    Suppose we've solved this problem. It is a firewall issue. To run MPI application on the HPC Pack cluster, you need to explicitly enable the MPI application through the firewall, just clusrun the following command on all compute nodes,

     Hpcfwutil  register  mpi_app  <file path for mpi_app.exe>

    Or you may also have custom firewall rules to allow the connection.

    Regards,

    Yutong Sun

    Friday, December 6, 2019 8:17 AM