a sudden interuption of the MPI job
-
11 Juni 2009 10:46
I am runing a MPI job on MSHPC cluster. it is runing ok at the beginging of the calculation, the output was written correctly, but the job was just suddenly stoped in the middle. The error message was as follows. I checked the code line where the "tag=273" pointed at, it is correct. could anybody know the reason? Cheers
Cindy
****************************************
job aborted:
rank: node: exit code: message
0: SB-NODE011: fatal error: Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(179)...........: MPI_Recv(buf=0x0000000001A4F408, count=1, MPI_INTEGER, src=4, tag=273, MPI_COMM_WORLD, status=0x0000000000712400) failed
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
1: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(192).........:
MPIC_Recv(98)...........:
MPIC_Wait(321)..........:
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
2: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(192).........:
MPIC_Recv(98)...........:
MPIC_Wait(321)..........:
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
3: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(192).........:
MPIC_Recv(98)...........:
MPIC_Wait(321)..........:
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
4: SB-NODE013: 157: process exited without calling finalize
5: SB-NODE013: terminated
6: SB-NODE013: terminated
7: SB-NODE013: terminated---- error analysis -----
4: sotoncaa.exe ended prematurely and may have crashed on SB-NODE013
**********************************************************
Semua Balasan
-
17 Juni 2009 21:30
Hi Cindy,
seems that rank 4 exit without calling finalize, with exit code 157. That process could have crashed or called exit().
What version of msmpi/windows hpc are you running (seems like version CCS 1)
thanks,
.Erez- Ditandai sebagai Jawaban oleh Don PatteeModerator 09 Desember 2009 6:30
-
17 Agustus 2012 3:08
I has a question.I need help!
I has a question.I need help!
Thanks!
I am runing a MPI job on cluster. it is runing ok at the beginging of the calculation when only use one computer.