發問發問
 

問題a sudden interuption of the MPI job

  • Thursday, 11 June, 2009 10:46Cindy.W 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     

    I am runing a MPI job on MSHPC cluster. it is runing ok at the beginging of the calculation, the output was written correctly, but the job was just suddenly stoped in the middle.  The error message was as follows. I checked the code line where the "tag=273" pointed at, it is correct. could anybody know the reason? Cheers

    Cindy

    ****************************************
    job aborted:
    rank: node: exit code: message
    0: SB-NODE011: fatal error: Fatal error in MPI_Recv: Other MPI error, error stack:
    MPI_Recv(179)...........: MPI_Recv(buf=0x0000000001A4F408, count=1, MPI_INTEGER, src=4, tag=273, MPI_COMM_WORLD, status=0x0000000000712400) failed
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    1: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    2: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    3: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    4: SB-NODE013: 157: process exited without calling finalize
    5: SB-NODE013: terminated
    6: SB-NODE013: terminated
    7: SB-NODE013: terminated

    ---- error analysis -----

    4: sotoncaa.exe ended prematurely and may have crashed on SB-NODE013
    **********************************************************

所有回覆

  • Wednesday, 17 June, 2009 21:30LioMSFT使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Hi Cindy,

    seems that rank 4 exit without calling finalize, with exit code 157. That process could have crashed or called exit().

    What version of msmpi/windows hpc are you running (seems like version CCS 1)

    thanks,
    .Erez