How to implement fault tolerance on MPI app? RRS feed

  • Question

  • When we write C programs with MPI to run on HPC, we know that If any process of the task failed, the whole task will failed.

    So my question is: Is it possible to implement fault tolerance in MPI program? For example, if any process in the task failed because of temporary network failure or program bug, the whole task could have a way to detect this failure and survive, or even restart a new process in the task to replace the failed one.

    But based on my knowledge, MPI program has no way to achive this. It will be very useful if this could be done in MPI.

    Tuesday, January 11, 2011 2:10 PM

All replies

  • You're right. MPI standard doesn't provide fault tolerance. You need to do it yourself. Some form of checkpointing is probably what you are looking for. You can periodically save state and upon restart check that and continue.
    Friday, February 11, 2011 6:23 PM