a sudden interuption of the MPI job<p>I am runing a MPI job on MSHPC cluster. it is runing ok at the beginging of the calculation, the output was written correctly, but the job was just suddenly stoped in the middle.  The error message was as follows. I checked the code line where the &quot;tag=273&quot; pointed at, it is correct. could anybody know the reason? Cheers<br/><br/>Cindy<br/><br/>****************************************<br/>job aborted:<br/>rank: node: exit code: message<br/>0: SB-NODE011: fatal error: Fatal error in MPI_Recv: Other MPI error, error stack:<br/>MPI_Recv(179)...........: MPI_Recv(buf=0x0000000001A4F408, count=1, MPI_INTEGER, src=4, tag=273, MPI_COMM_WORLD, status=0x0000000000712400) failed<br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>1: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:<br/>MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed<br/>MPIR_Bcast(192).........: <br/>MPIC_Recv(98)...........: <br/>MPIC_Wait(321)..........: <br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>2: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:<br/>MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed<br/>MPIR_Bcast(192).........: <br/>MPIC_Recv(98)...........: <br/>MPIC_Wait(321)..........: <br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>3: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:<br/>MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed<br/>MPIR_Bcast(192).........: <br/>MPIC_Recv(98)...........: <br/>MPIC_Wait(321)..........: <br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>4: SB-NODE013: 157: process exited without calling finalize<br/>5: SB-NODE013: terminated<br/>6: SB-NODE013: terminated<br/>7: SB-NODE013: terminated</p> <p>---- error analysis -----</p> <p>4: sotoncaa.exe ended prematurely and may have crashed on SB-NODE013<br/>**********************************************************</p>© 2009 Microsoft Corporation. All rights reserved.Wed, 17 Jun 2009 21:30:20 Z0791b8a1-85e6-4271-9eef-5f3ec58e86b1http://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/0791b8a1-85e6-4271-9eef-5f3ec58e86b1#0791b8a1-85e6-4271-9eef-5f3ec58e86b1http://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/0791b8a1-85e6-4271-9eef-5f3ec58e86b1#0791b8a1-85e6-4271-9eef-5f3ec58e86b1Cindy.Whttp://social.microsoft.com/Profile/en-US/?user=Cindy.Wa sudden interuption of the MPI job<p>I am runing a MPI job on MSHPC cluster. it is runing ok at the beginging of the calculation, the output was written correctly, but the job was just suddenly stoped in the middle.  The error message was as follows. I checked the code line where the &quot;tag=273&quot; pointed at, it is correct. could anybody know the reason? Cheers<br/><br/>Cindy<br/><br/>****************************************<br/>job aborted:<br/>rank: node: exit code: message<br/>0: SB-NODE011: fatal error: Fatal error in MPI_Recv: Other MPI error, error stack:<br/>MPI_Recv(179)...........: MPI_Recv(buf=0x0000000001A4F408, count=1, MPI_INTEGER, src=4, tag=273, MPI_COMM_WORLD, status=0x0000000000712400) failed<br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>1: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:<br/>MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed<br/>MPIR_Bcast(192).........: <br/>MPIC_Recv(98)...........: <br/>MPIC_Wait(321)..........: <br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>2: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:<br/>MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed<br/>MPIR_Bcast(192).........: <br/>MPIC_Recv(98)...........: <br/>MPIC_Wait(321)..........: <br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>3: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:<br/>MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed<br/>MPIR_Bcast(192).........: <br/>MPIC_Recv(98)...........: <br/>MPIC_Wait(321)..........: <br/>MPIDI_CH3I_Progress(165): handle_sock_op failed<br/>handle_sock_read(530)...: <br/>ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)<br/>4: SB-NODE013: 157: process exited without calling finalize<br/>5: SB-NODE013: terminated<br/>6: SB-NODE013: terminated<br/>7: SB-NODE013: terminated</p> <p>---- error analysis -----</p> <p>4: sotoncaa.exe ended prematurely and may have crashed on SB-NODE013<br/>**********************************************************</p>Thu, 11 Jun 2009 10:46:39 Z2009-06-11T10:46:39Zhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/0791b8a1-85e6-4271-9eef-5f3ec58e86b1#28df842a-3dfa-4afc-b838-3f2774c0b468http://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/0791b8a1-85e6-4271-9eef-5f3ec58e86b1#28df842a-3dfa-4afc-b838-3f2774c0b468Liohttp://social.microsoft.com/Profile/en-US/?user=Lioa sudden interuption of the MPI jobHi Cindy,<br/><br/>seems that rank 4 exit without calling finalize, with exit code 157. That process could have crashed or called exit().<br/><br/>What version of msmpi/windows hpc are you running (seems like version CCS 1)<br/><br/>thanks,<br/>.ErezWed, 17 Jun 2009 21:30:20 Z2009-06-17T21:30:20Z