none
MPI fatal error RRS feed

  • Question

  • Getting the following MPI errors:

    job aborted:
    [ranks] message

    [0-63] terminated

    [64] fatal error
    Fatal error in MPI_Dist_graph_create_adjacent: Other MPI error, error stack:
    MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=1, sources=0x00000088F94FECD0, sourceweights=0x0000000000000001, outdegree=1, destinations=0x00000088F94FECD0, destweights=0x0000000000000001, info=469762048, reorder=1) failed
    [ch3:sock] failed to connnect to remote process 008f86ee-2a16-496f-94d1-d1c44f786252:96
    unable to connect to 192.168.1.104 on port 50614, exhausted all endpoints
    unable to connect to 192.168.1.104 on port 50614, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.  (errno 10060)

    [65-127] terminated

    ---- error analysis -----

    [64] on MTC-03
    mpi has detected a fatal error and aborted alix

    ---- error analysis -----

    The particular MPI function call varies. Above it's in MPI_Dist_graph_create_adjacent, but have seen it in MPI_Win_allocate as well. The "failed to connnect to remote process" error is the same.

    The error only seems to appear after a certain number of processors are allocated to the job. For example, each node has 16 cores (32 hardware threads), and when I allocate 64 cores it runs fine on two nodes. When I allocate 128 cores for the job, then I get the errors mentioned above.

    Wednesday, April 17, 2019 7:04 PM

All replies

  • BTW, can ping the nodes in question. I've even logged into those nodes via remote desktop, and it seems everything about them is operational.
    Wednesday, April 17, 2019 7:06 PM
  • Looks like a node connectivity/firewall issue.
    Are you able to launch any simple program on all nodes that are used for 128 core job? (trying to isolate the issue)

    Also, try to run with verbose mode (-d 2) to get additional details.

    Friday, April 19, 2019 8:03 PM
  • Hi Jithin,

    It appears the problem was a Windows licensing issue. Windows Server Essentials 2016 cannot be joined to a domain unless is it is the domain controller. The event log was full of errors with this message. We reinstalled the cluster with Windows Standard 2016. Since then, we cannot reproduce this problem any longer. I've been able to allocate much more cores than mentioned above and the jobs are running/completing successfully.

    Nate

    Tuesday, April 23, 2019 8:41 PM