none
Intermittent hang in MPI_COMM_SPLIT (HPC 2008 SDK) RRS feed

  • Question

  • Occasionally, (maybe 10% of the time) my mpiexec job hangs during the first call to MPI_COMM_SPLIT on MPI_COMM_WORLD.

    If the job gets past this first split, it always runs to completion.

    Has anyone seen something like this?

    thanks

    David

     

    SW Environment:  Windows 2008 SDK (mpiexec, smpd), MPI.Net, Visual Studio 2010/CLR 4

    Compute OS's:  Windows Server 2008 R2, Windows XP SP3 (32bit)

    Network: 1Gbps ethernet

    Application:  1 or more message initiators  which send messages over a communicator to a single repeater, which sends the messages over a second communicator to a single final receiving process.  All initiators are run on the same host, but repeater and final receiver can be on any host.

    Failure  characteristics: Intermittent hangs observed when the repeater or final receiver is run remotely form the initiators. 

    Code snippet:

    type MPIConst =

        | DontCareColor = 32768

        | DontCareTag = 0

    //Split using the queue's color (a small integer) for queues we communicate over, or DontCareColor for queues we don't use.
            for candidate in KnownQueues.All do
                match Seq.tryFind (fun actual -> actual = candidate) qs with
                | Some (Queue(name, MPI color, _)) -> 
                    Logger.Log ("Initialize: Splitting queue {0}", name)
                    let queueComm = MPI.Communicator.world.Split(color, int MPIConst.DontCareTag)
                    Logger.Log ("Initialize: Split queue {0}, size {1}", name, queueComm.Size)
                    Comms.[name] <- queueComm 
                | _ -> 
    // MPIBUG: MPI.Net complains if < 0, but per MPI 2.2 spec, it should allow NoProcess ( == MPI_UNDEFINED )
    //              MPI.Communicator.world.Split(MPI.Group.NoProcess, DontCareTag) |> ignore            
                    Logger.Log "Initialize: Splitting don't care queue"
                    MPI.Communicator.world.Split(int MPIConst.DontCareColor, int MPIConst.DontCareTag).Dispose()
                    Logger.Log "Initialize: Disposing don't care queue"


    Log shows:
    [2.Firehose]       Initialize: Splitting don't care queue
    [1.Firehose]       Initialize: Splitting don't care queue
    [4.Guzzler]        Initialize: Splitting queue Q1
    [3.Piper]          Initialize: Splitting queue Q1
    [0.Firehose]       Initialize: Splitting don't care queue


    Typical command line:
    c:\BuildBin\HighSpeedBus\Release>mpiexec  -n 3  -host ca1tesla1 c:\BuildBin\HighSpeedBus\Release\Firehose.exe : -host ca1tesla1 -n 1 c:\BuildBin\HighSpeedBus\Release\Piper.exe /numProducers=3 : -host calt0677 -n 1 c:\BuildBin\HighSpeedBus\Release\Guzzler.exe

    Other diagnostic info: 
    Vampir indicated that all but one of processes on the machine initiating the mpiexec job returned from MPI_AllGather called from MPI_Split.  A single process seemed to be hung in MPI_AllGather (called from MPI_Split).   Have not been able to use any of the trace tools effectively in multi-host environment, since trace files on the non-originating hosts do not seem to contain rank information. (will post another question regarding this).

    Any help/suggestions greatly appreciated

    Tuesday, May 4, 2010 6:44 PM

Answers