提出问题提出问题
 

问题a sudden interuption of the MPI job

  • 2009年6月11日 10:46Cindy.W 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     

    I am runing a MPI job on MSHPC cluster. it is runing ok at the beginging of the calculation, the output was written correctly, but the job was just suddenly stoped in the middle.  The error message was as follows. I checked the code line where the "tag=273" pointed at, it is correct. could anybody know the reason? Cheers

    Cindy

    ****************************************
    job aborted:
    rank: node: exit code: message
    0: SB-NODE011: fatal error: Fatal error in MPI_Recv: Other MPI error, error stack:
    MPI_Recv(179)...........: MPI_Recv(buf=0x0000000001A4F408, count=1, MPI_INTEGER, src=4, tag=273, MPI_COMM_WORLD, status=0x0000000000712400) failed
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    1: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    2: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    3: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    4: SB-NODE013: 157: process exited without calling finalize
    5: SB-NODE013: terminated
    6: SB-NODE013: terminated
    7: SB-NODE013: terminated

    ---- error analysis -----

    4: sotoncaa.exe ended prematurely and may have crashed on SB-NODE013
    **********************************************************

全部回复

  • 2009年6月17日 21:30LioMSFT用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    Hi Cindy,

    seems that rank 4 exit without calling finalize, with exit code 157. That process could have crashed or called exit().

    What version of msmpi/windows hpc are you running (seems like version CCS 1)

    thanks,
    .Erez