Intermittent MPI failure
-
21 Februari 2008 18:11
So we’ve got an MPI job that’s running along fine for about 8 minutes of a 1hr job (19 node cluster, 8 procs per node, public gigabit network only)
Then, all of a sudden, we get a failure – looks like a TCP/IP problem:
job aborted:
rank: node: exit code: message
0: MT-DIST-30: terminated
1: MT-DIST-30: terminated
…snip…
112: MT-DIST-26: terminated
113: MT-DIST-26: terminated
114: MT-DIST-26: fatal error: Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(112)..........: MPI_Probe(src=0, tag=2, MPI_COMM_WORLD, status=0x002DFDF8) failed
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
115: MT-DIST-26: terminated
116: MT-DIST-26: terminated
…snip…
135: MT-DIST-25: terminated
136: MT-DIST-28: terminated
137: MT-DIST-28: fatal error: Fatal error in MPI_Probe: Other MPI error, error stack:
MPI_Probe(112)..........: MPI_Probe(src=0, tag=2, MPI_COMM_WORLD, status=0x002DFDF8) failed
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
138: MT-DIST-28: terminated
139: MT-DIST-28: terminated
…snip…
150: MT-DIST-38: terminated
151: MT-DIST-38: terminated
---- error analysis -----
114: mpi has detected a fatal error and aborted nlpmpi.exe run on MT-DIST-26
137: mpi has detected a fatal error and aborted nlpmpi.exe run on MT-DIST-28
---- error analysis -----
Nothing untoward in the eventlogs on the three nodes involved; nothing odd in the logs for 114 or 137. Furthermore I can confirm that processes of rank 114 and 137 were up and running for over 1min before the forcible shutdown.
Note also that we’ve shut off the shared memory communication so that the processes don’t burn cycles doing a busywait (“mpiexec -env MPICH_DISABLE_SHM 1 …”).
Any idea how to diagnose this?
Thanks!
--Chris
Semua Balasan
-
22 Februari 2008 17:34
I would check rank 0 process. did it crash? can you look for a crash dump? (usuay under %windir%\pchealth\ErrorRep\UserDumps).
(sometimes the v1 mpiexec does not detect the crash but sees the side effects first)
thanks,
.Erez
- Ditandai sebagai Jawaban oleh Chris Quirk 12 Mei 2009 22:13
-
01 April 2008 22:14
Do we have any solution for this?
I am getting this error.. Reply appriciated.
Thanks,
Kathir
-
03 April 2008 4:19
did you find any crash dump?