Jawab Intermittent MPI failure

  • 21 Februari 2008 18:11
     
     

     

    So we’ve got an MPI job that’s running along fine for about 8 minutes of a 1hr job (19 node cluster, 8 procs per node, public gigabit network only)

     

    Then, all of a sudden, we get a failure – looks like a TCP/IP problem:

    job aborted:

    rank: node: exit code: message

    0: MT-DIST-30: terminated

    1: MT-DIST-30: terminated

    …snip…

    112: MT-DIST-26: terminated

    113: MT-DIST-26: terminated

    114: MT-DIST-26: fatal error: Fatal error in MPI_Probe: Other MPI error, error stack:

    MPI_Probe(112)..........: MPI_Probe(src=0, tag=2, MPI_COMM_WORLD, status=0x002DFDF8) failed

    MPIDI_CH3I_Progress(165): handle_sock_op failed

    handle_sock_read(530)...:

    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)

    115: MT-DIST-26: terminated

    116: MT-DIST-26: terminated

    …snip…

    135: MT-DIST-25: terminated

    136: MT-DIST-28: terminated

    137: MT-DIST-28: fatal error: Fatal error in MPI_Probe: Other MPI error, error stack:

    MPI_Probe(112)..........: MPI_Probe(src=0, tag=2, MPI_COMM_WORLD, status=0x002DFDF8) failed

    MPIDI_CH3I_Progress(165): handle_sock_op failed

    handle_sock_read(530)...:

    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)

    138: MT-DIST-28: terminated

    139: MT-DIST-28: terminated

    …snip…

    150: MT-DIST-38: terminated

    151: MT-DIST-38: terminated

     

    ---- error analysis -----

     

    114: mpi has detected a fatal error and aborted nlpmpi.exe run on MT-DIST-26

    137: mpi has detected a fatal error and aborted nlpmpi.exe run on MT-DIST-28

     

    ---- error analysis -----

     

    Nothing untoward in the eventlogs on the three nodes involved; nothing odd in the logs for 114 or 137.  Furthermore I can confirm that processes of rank 114 and 137 were up and running for over 1min before the forcible shutdown.

     

    Note also that we’ve shut off the shared memory communication so that the processes don’t burn cycles doing a busywait (“mpiexec -env MPICH_DISABLE_SHM 1 …”).

     

    Any idea how to diagnose this?

     

    Thanks!

    --Chris

Semua Balasan

  • 22 Februari 2008 17:34
     
     Jawab

     

    I would check rank 0 process. did it crash? can you look for a crash dump? (usuay under %windir%\pchealth\ErrorRep\UserDumps).

     

    (sometimes the v1 mpiexec does not detect the crash but sees the side effects first)

     

    thanks,

    .Erez

    • Ditandai sebagai Jawaban oleh Chris Quirk 12 Mei 2009 22:13
    •  
  • 01 April 2008 22:14
     
     

     

    Do we have any solution for this?

    I am getting this error.. Reply appriciated.

     

    Thanks,

    Kathir

  • 03 April 2008 4:19
     
     

    did you find any crash dump?