locked
Intermittent MPI failure RRS feed

  • Question

  •  

    So we’ve got an MPI job that’s running along fine for about 8 minutes of a 1hr job (19 node cluster, 8 procs per node, public gigabit network only)

     

    Then, all of a sudden, we get a failure – looks like a TCP/IP problem:

    job aborted:

    rank: node: exit code: message

    0: MT-DIST-30: terminated

    1: MT-DIST-30: terminated

    …snip…

    112: MT-DIST-26: terminated

    113: MT-DIST-26: terminated

    114: MT-DIST-26: fatal error: Fatal error in MPI_Probe: Other MPI error, error stack:

    MPI_Probe(112)..........: MPI_Probe(src=0, tag=2, MPI_COMM_WORLD, status=0x002DFDF8) failed

    MPIDI_CH3I_Progress(165): handle_sock_op failed

    handle_sock_read(530)...:

    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)

    115: MT-DIST-26: terminated

    116: MT-DIST-26: terminated

    …snip…

    135: MT-DIST-25: terminated

    136: MT-DIST-28: terminated

    137: MT-DIST-28: fatal error: Fatal error in MPI_Probe: Other MPI error, error stack:

    MPI_Probe(112)..........: MPI_Probe(src=0, tag=2, MPI_COMM_WORLD, status=0x002DFDF8) failed

    MPIDI_CH3I_Progress(165): handle_sock_op failed

    handle_sock_read(530)...:

    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)

    138: MT-DIST-28: terminated

    139: MT-DIST-28: terminated

    …snip…

    150: MT-DIST-38: terminated

    151: MT-DIST-38: terminated

     

    ---- error analysis -----

     

    114: mpi has detected a fatal error and aborted nlpmpi.exe run on MT-DIST-26

    137: mpi has detected a fatal error and aborted nlpmpi.exe run on MT-DIST-28

     

    ---- error analysis -----

     

    Nothing untoward in the eventlogs on the three nodes involved; nothing odd in the logs for 114 or 137.  Furthermore I can confirm that processes of rank 114 and 137 were up and running for over 1min before the forcible shutdown.

     

    Note also that we’ve shut off the shared memory communication so that the processes don’t burn cycles doing a busywait (“mpiexec -env MPICH_DISABLE_SHM 1 …”).

     

    Any idea how to diagnose this?

     

    Thanks!

    --Chris

    Thursday, February 21, 2008 6:11 PM

Answers

  •  

    I would check rank 0 process. did it crash? can you look for a crash dump? (usuay under %windir%\pchealth\ErrorRep\UserDumps).

     

    (sometimes the v1 mpiexec does not detect the crash but sees the side effects first)

     

    thanks,

    .Erez

    • Marked as answer by Chris Quirk Tuesday, May 12, 2009 10:13 PM
    Friday, February 22, 2008 5:34 PM

All replies

  •  

    I would check rank 0 process. did it crash? can you look for a crash dump? (usuay under %windir%\pchealth\ErrorRep\UserDumps).

     

    (sometimes the v1 mpiexec does not detect the crash but sees the side effects first)

     

    thanks,

    .Erez

    • Marked as answer by Chris Quirk Tuesday, May 12, 2009 10:13 PM
    Friday, February 22, 2008 5:34 PM
  •  

    Do we have any solution for this?

    I am getting this error.. Reply appriciated.

     

    Thanks,

    Kathir

    Tuesday, April 1, 2008 10:14 PM
  • did you find any crash dump?

    Thursday, April 3, 2008 4:19 AM