How MPI perform as usual even when part of the processes are terminated? RRS feed

  • Question

  • I know MPI_Barrier() can guarantee synchronization, but when one of the hosts occurring error, the other hosts all cannot continue to run. what I wish is that even if some of the processes occur error, the remaining processes will still work. Dose any one knows whether the MPI provides such a function,?

    Very grateful.

    Tuesday, April 3, 2018 7:30 AM

All replies

  • MPI processes are generally created at startup and continue throughout the entire job execution. If one of the processes exits earlier, the entire MPI job also fails/exits.

    MPI supports dynamic process model (DPM), which allows creation/termination of processes after an MPI program has started. Can you make use of DPM for your use case?  http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf (Section 10.2)

    Wednesday, April 4, 2018 4:51 PM
  • Dear JithinJos,

    Sincerely thanks,but this seems a bit different from what I want to do. What I hope to do is: Suppose one of the clusters has stopped working, such as when a host suddenly cuts off the network, how to make other hosts work as usual, or at least remind me Which host is disconnected from the network?

    Friday, April 6, 2018 6:04 AM