Getting the following MPI errors:
job aborted:
[ranks] message
[0-63] terminated
[64] fatal error
Fatal error in MPI_Dist_graph_create_adjacent: Other MPI error, error stack:
MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD, indegree=1, sources=0x00000088F94FECD0, sourceweights=0x0000000000000001, outdegree=1, destinations=0x00000088F94FECD0, destweights=0x0000000000000001, info=469762048, reorder=1) failed
[ch3:sock] failed to connnect to remote process 008f86ee-2a16-496f-94d1-d1c44f786252:96
unable to connect to 192.168.1.104 on port 50614, exhausted all endpoints
unable to connect to 192.168.1.104 on port 50614, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
[65-127] terminated
---- error analysis -----
[64] on MTC-03
mpi has detected a fatal error and aborted alix
---- error analysis -----
The particular MPI function call varies. Above it's in MPI_Dist_graph_create_adjacent, but have seen it in MPI_Win_allocate as well. The "failed to connnect to remote process" error is the same.
The error only seems to appear after a certain number of processors are allocated to the job. For example, each node has 16 cores (32 hardware threads), and when I allocate 64 cores it runs fine on two nodes. When I allocate 128 cores for the job, then I
get the errors mentioned above.