none
MS-MPI MPI_Barrier: sometimes hangs indefinitely, sometimes doesn't RRS feed

  • Question

  • I'm using the MPI.NET library, which is a .NET wrapper around the msmpi.dll.

    I've recently moved my application to a bigger cluster (more COMPUTE-NODES). I've started seeing various collective functions hang indefinitely, but only sometimes. About half the time a job will complete, the rest of the time it'll hang. I've seen it happen with Scatter, Broadcast, and Barrier.

    I've put a MPI.Communicator.world.Barrier() call (MPI.NET) at the start of the application, and created trace logs (using the MPIEXEC.exe /trace switch).

    C# code snippet:

    static void Main(string[] args)
    {
        var hostName = System.Environment.MachineName;
        Logger.Trace($"Program.Main entered on {hostName}");
        string[] mpiArgs = null;
        MPI.Environment myEnvironment = null;
        try
        {
            Logger.Trace($"Trying to instantiated on MPI.Environment on {hostName}. Is currently initialized? {MPI.Environment.Initialized}");
            myEnvironment = new MPI.Environment(ref mpiArgs);
            Logger.Trace($"Is currently initialized?{MPI.Environment.Initialized}. {hostName} is waiting at Barrier... ");
            Communicator.world.Barrier(); // CODE HANGS HERE!
            Logger.Trace($"{hostName} is past Barrier");
        }
        catch (Exception envEx)
        {
            Logger.Error(envEx, "Could not instantiate MPI.Environment object");
        }
    
        // rest of implementation here...
    
    }

    I can see the msmpi.dll's MPI_Barrier function being called in the log, and I can see messages being sent and received thereafter for a passing and a failing example. For the passing example, messages are sent/received and then the MPI_Barrier function Leave is logged.

    For the failing example it look like one (or more) of the send messages is lost - it is never received by the target. Am I correct in thinking that messages lost within the MPI_Barrier call will mean that the processes never synchronize, therefore all get stuck at the Communicator.world.Barrier() call?

    What could be causing this to happen intermittently? Could poor network performance between the COMPUTE-NODES be a cause? This only happens when I run a task across more than one node - run on 8 cores on one node everything is fine, run on 9 cores on two nodes task will hang at the barrier ~50% of the time.

    I'm running MS HPC Pack 2008 R2, so the version of MS-MPI is pretty old, v2.0.


    Monday, April 3, 2017 4:30 AM

All replies

  • Hi Matt,

    Can you try with MS-MPI v8 and let us know if this is still happening? Also what happens if you compile a simple C++ program and run it instead of MPI.NET? We do not support MPI.NET and I think the original author of MPI.NET has moved on to other things. In principle it still should work, but due to our lack of familiarity with its mechanism our ability to troubleshoot issues will be limited

    Anh

    Wednesday, April 5, 2017 3:00 PM