I am runing a MPI job on MSHPC cluster. it is runing ok at the beginging of the calculation, the output was written correctly, but the job was just suddenly stoped in the middle. The error message was as follows. I checked the code line where the "tag=273" pointed at, it is correct. could anybody know the reason? Cheers
Cindy
****************************************
job aborted:
rank: node: exit code: message
0: SB-NODE011: fatal error: Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(179)...........: MPI_Recv(buf=0x0000000001A4F408, count=1, MPI_INTEGER, src=4, tag=273, MPI_COMM_WORLD, status=0x0000000000712400) failed
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
1: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(192).........:
MPIC_Recv(98)...........:
MPIC_Wait(321)..........:
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
2: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(192).........:
MPIC_Recv(98)...........:
MPIC_Wait(321)..........:
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
3: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(192).........:
MPIC_Recv(98)...........:
MPIC_Wait(321)..........:
MPIDI_CH3I_Progress(165): handle_sock_op failed
handle_sock_read(530)...:
ReadFailed(1518)........: An existing connection was forcibly closed by the remote host. (errno 10054)
4: SB-NODE013: 157: process exited without calling finalize
5: SB-NODE013: terminated
6: SB-NODE013: terminated
7: SB-NODE013: terminated
---- error analysis -----
4: sotoncaa.exe ended prematurely and may have crashed on SB-NODE013
**********************************************************