none
a sudden interuption of the MPI job RRS feed

  • Question

  • I am runing a MPI job on MSHPC cluster. it is runing ok at the beginging of the calculation, the output was written correctly, but the job was just suddenly stoped in the middle.  The error message was as follows. I checked the code line where the "tag=273" pointed at, it is correct. could anybody know the reason? Cheers

    Cindy

    ****************************************
    job aborted:
    rank: node: exit code: message
    0: SB-NODE011: fatal error: Fatal error in MPI_Recv: Other MPI error, error stack:
    MPI_Recv(179)...........: MPI_Recv(buf=0x0000000001A4F408, count=1, MPI_INTEGER, src=4, tag=273, MPI_COMM_WORLD, status=0x0000000000712400) failed
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    1: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    2: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    3: SB-NODE011: fatal error: Fatal error in MPI_Bcast: Other MPI error, error stack:
    MPI_Bcast(791)..........: MPI_Bcast(buf=0x0000000001A4F408, count=1, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
    MPIR_Bcast(192).........:
    MPIC_Recv(98)...........:
    MPIC_Wait(321)..........:
    MPIDI_CH3I_Progress(165): handle_sock_op failed
    handle_sock_read(530)...:
    ReadFailed(1518)........: An existing connection was forcibly closed by the remote host.  (errno 10054)
    4: SB-NODE013: 157: process exited without calling finalize
    5: SB-NODE013: terminated
    6: SB-NODE013: terminated
    7: SB-NODE013: terminated

    ---- error analysis -----

    4: sotoncaa.exe ended prematurely and may have crashed on SB-NODE013
    **********************************************************

    Thursday, June 11, 2009 10:46 AM

Answers

  • Hi Cindy,

    seems that rank 4 exit without calling finalize, with exit code 157. That process could have crashed or called exit().

    What version of msmpi/windows hpc are you running (seems like version CCS 1)

    thanks,
    .Erez
    Wednesday, June 17, 2009 9:30 PM

All replies

  • Hi Cindy,

    seems that rank 4 exit without calling finalize, with exit code 157. That process could have crashed or called exit().

    What version of msmpi/windows hpc are you running (seems like version CCS 1)

    thanks,
    .Erez
    Wednesday, June 17, 2009 9:30 PM
  • I has a question.I need help!

    I has a question.I need help!

    Thanks!

    I am runing a MPI job on cluster. it is runing ok at the beginging of the calculation  when only use one computer.

    Friday, August 17, 2012 3:08 AM