提出问题提出问题
 

已答复WHPCS2008 MPI diagnostics failing randomly

  • 2009年6月19日 6:10kellydavid 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    Hi,

    I'm trying to help a customer who is running WHPCS 2008 on an SGI cluster (of Altix XE310 servers). WHPCS 2008 passes all the built-in diagnostic tests except for the MPI ping-pong and lightweight throughput tests. These fail on random nodes. The cluster is setup with a private GigE network for management and an Infiniband (IB) network for MPI traffic.  The IB network is connected to a Cisco SFS7000d switch which us running the subnet manager. 
    When running the MPI ping-pong and lightweight throughput tests, multiple nodes fail randomly. For example, the tests are run on a subset of the cluster works OK, but after they get up to around 14 nodes, the test will fail on a random node. (The cluster has about 20 nodes in total.) However, a "clusrun" of simple commands works OK across all nodes. The log output from the MPI tests is as follows:

    Time      Message

    18/06/2009 12:32:06 PM                Reverted

    18/06/2009 12:32:05 PM                The operation failed due to errors during execution.

    18/06/2009 12:32:05 PM                The operation failed and will not be retried.

    18/06/2009 12:32:05 PM                ---- error analysis -----

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                mpi has detected a fatal error and aborted mpipingpong.exe

    18/06/2009 12:32:05 PM                [5] on MICRO15

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                ---- error analysis -----

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [6-19] terminated

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                CH3_ND::CEndpoint::ConnReqFailed(407): [ch3:nd] INDConnector::Connect to 192.168.2.111:1 failed with 0x80070043

    18/06/2009 12:32:05 PM                CH3_ND::CEndpoint::Connect(236)......:

    18/06/2009 12:32:05 PM                CH3_ND::CEnvironment::Connect(400)...:

    18/06/2009 12:32:05 PM                MPIDI_CH3I_VC_post_connect(426)......: MPIDI_CH3I_Nd_connect failed in VC_post_connect

    18/06/2009 12:32:05 PM                MPIDI_CH3_iSendv(239)................:

    18/06/2009 12:32:05 PM                MPIDI_EagerContigIsend(519)..........: failure occurred while attempting to send an eager message

    18/06/2009 12:32:05 PM                MPIC_Sendrecv(120)...................:

    18/06/2009 12:32:05 PM                MPIR_Allgather(487)..................:

    18/06/2009 12:32:05 PM                MPI_Allgather(864)...................: MPI_Allgather(sbuf=0x000000000022F750, scount=128, MPI_CHAR, rbuf=0x0000000000B71100, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed

    18/06/2009 12:32:05 PM                Fatal error in MPI_Allgather: Other MPI error, error stack:

    18/06/2009 12:32:05 PM                [5] fatal error

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [0-4] terminated

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [ranks] message

    18/06/2009 12:32:05 PM                job aborted:

    18/06/2009 12:32:05 PM               

    18/06/2009 12:31:50 PM                Connecting to scheduler service on node micro.

     

    Has anyone seen this type f problem? Any suggestions for resolving it?

    Thanks very much.

    Regards,

    David

    • 已移动parmita mehta版主2009年6月23日 17:34related to mpi failure (From:Windows HPC Server Deployment, Management, and Administration)
    •  

答案

  • 2009年7月1日 0:07kellydavid 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     已答复
    Hi Fab,
    Our guys appear to have identified the problem - a bad Infiniband card in one of the servers that seems to be affecting the network as a whole (somehow). Disconnecting the IB card from the IB switch seems to fix the problem. They're going to replace the whole server. Thanks very much for your help. 

    Regards,

    David

全部回复

  • 2009年6月24日 0:47Fab Tillier [MS]MSFT用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    Hi David,

    The error you're getting looks like it is related to the IP to IB address translation in the IB drivers.  A few questions:

    1. What version of the IB drivers are you running?
    2. Is the firmware up to date on both the switch as well as the HCAs?

    Thanks,
    -Fab
  • 2009年6月25日 23:16kellydavid 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    Thanks Fab.

    The software installed is WinOF_2_0_5, and from device manager, the Driver Assembly Version is 4335.

    I've also got "vstat" and "ibdiagnet" output if you like - it's too big to include inline here.

    Thanks for your help.

    Regards,

    David

  • 2009年6月29日 0:52Fab Tillier [MS]MSFT用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    The software installed is WinOF_2_0_5, and from device manager, the Driver Assembly Version is 4335.
    That looks like the latest.
    I've also got "vstat" and "ibdiagnet" output if you like - it's too big to include inline here.
    vstat output would be nice to see.  Can you ping between the machines?

    Does the error only occur under stress?

    We've seen problems with connectivity when firmware versions were out of date.  While vstat shows the FW version on the HCA, I don't know what tool will show the FW version of the switch - perhaps you can log in to the switch and find out?

    Thanks,
    -Fab


  • 2009年7月1日 0:07kellydavid 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     已答复
    Hi Fab,
    Our guys appear to have identified the problem - a bad Infiniband card in one of the servers that seems to be affecting the network as a whole (somehow). Disconnecting the IB card from the IB switch seems to fix the problem. They're going to replace the whole server. Thanks very much for your help. 

    Regards,

    David