none
MPI_FINALIZE fails with error: rank 1 unable to connect to rank 0 RRS feed

  • Question

  • Can someone please help me on below error

    job aborted:
    [ranks] message

    [0] terminated

    [1] fatal error
    Fatal error in MPI_Comm_dup: Other MPI error, error stack:
    MPI_Comm_dup(136)..............: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x012B5C34) failed
    MPIR_Comm_copy(500)............:
    MPIR_Get_contextid(248)........:
    PMPI_Allreduce(617)............: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x002AEFD0, count=32, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
    MPIR_Allreduce(219)............:
    MPIC_Sendrecv(170).............:
    MPID_Send(68)..................:
    MPIDI_CH3_SendEager(85)........:
    MPIDI_CH3I_VC_post_connect(376):
    MPIDI_CH3I_Sock_connect(304)...: [ch3:sock] rank 1 unable to connect to rank 0 using business card <port=53741 description="10.101.107.15 10.203.65.28 WPINVHPCP01 " shm_host=WPINVHPCP01 shm_queue=1812:920 >
    MPIDU_Sock_post_connect(1143)..:
    save_valid_endpoints(1072).....: unable to connect to 10.101.107.15 10.203.65.28 WPINVHPCP01  on port 53741, no endpoint matches the netmask 10.101.72.0/255.255.255.0

    ---- error analysis -----

    Thursday, February 12, 2015 5:53 PM

All replies

  • You seem to have specified a netmask (e.g. via MPICH_NETMASK environment variable) of 10.101.72.0/255.255.255.0 (note that HPC Pack will set the subnet to the Application network automatically via the CCP_MPI_NETMASK environment variable). The target machine (WPINVHPCP01) does not have a network interface on that subnet, so the connection is failing.  Check the network configuration of WPINVHPCP01 if the netmask was configured correctly, or change the netmask to allow MPI communication to go over a different subnet.

    Let us know how that goes,
    -Fab

    Tuesday, February 17, 2015 5:36 PM