locked
WHPCS2008 MPI diagnostics failing randomly RRS feed

  • Question

  • Hi,

    I'm trying to help a customer who is running WHPCS 2008 on an SGI cluster (of Altix XE310 servers). WHPCS 2008 passes all the built-in diagnostic tests except for the MPI ping-pong and lightweight throughput tests. These fail on random nodes. The cluster is setup with a private GigE network for management and an Infiniband (IB) network for MPI traffic.  The IB network is connected to a Cisco SFS7000d switch which us running the subnet manager. 
    When running the MPI ping-pong and lightweight throughput tests, multiple nodes fail randomly. For example, the tests are run on a subset of the cluster works OK, but after they get up to around 14 nodes, the test will fail on a random node. (The cluster has about 20 nodes in total.) However, a "clusrun" of simple commands works OK across all nodes. The log output from the MPI tests is as follows:

    Time      Message

    18/06/2009 12:32:06 PM                Reverted

    18/06/2009 12:32:05 PM                The operation failed due to errors during execution.

    18/06/2009 12:32:05 PM                The operation failed and will not be retried.

    18/06/2009 12:32:05 PM                ---- error analysis -----

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                mpi has detected a fatal error and aborted mpipingpong.exe

    18/06/2009 12:32:05 PM                [5] on MICRO15

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                ---- error analysis -----

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [6-19] terminated

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                CH3_ND::CEndpoint::ConnReqFailed(407): [ch3:nd] INDConnector::Connect to 192.168.2.111:1 failed with 0x80070043

    18/06/2009 12:32:05 PM                CH3_ND::CEndpoint::Connect(236)......:

    18/06/2009 12:32:05 PM                CH3_ND::CEnvironment::Connect(400)...:

    18/06/2009 12:32:05 PM                MPIDI_CH3I_VC_post_connect(426)......: MPIDI_CH3I_Nd_connect failed in VC_post_connect

    18/06/2009 12:32:05 PM                MPIDI_CH3_iSendv(239)................:

    18/06/2009 12:32:05 PM                MPIDI_EagerContigIsend(519)..........: failure occurred while attempting to send an eager message

    18/06/2009 12:32:05 PM                MPIC_Sendrecv(120)...................:

    18/06/2009 12:32:05 PM                MPIR_Allgather(487)..................:

    18/06/2009 12:32:05 PM                MPI_Allgather(864)...................: MPI_Allgather(sbuf=0x000000000022F750, scount=128, MPI_CHAR, rbuf=0x0000000000B71100, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed

    18/06/2009 12:32:05 PM                Fatal error in MPI_Allgather: Other MPI error, error stack:

    18/06/2009 12:32:05 PM                [5] fatal error

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [0-4] terminated

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [ranks] message

    18/06/2009 12:32:05 PM                job aborted:

    18/06/2009 12:32:05 PM               

    18/06/2009 12:31:50 PM                Connecting to scheduler service on node micro.

     

    Has anyone seen this type f problem? Any suggestions for resolving it?

    Thanks very much.

    Regards,

    David

    • Moved by parmita mehta Tuesday, June 23, 2009 5:34 PM related to mpi failure (From:Windows HPC Server Deployment, Management, and Administration)
    Friday, June 19, 2009 6:10 AM

Answers

  • Hi Fab,
    Our guys appear to have identified the problem - a bad Infiniband card in one of the servers that seems to be affecting the network as a whole (somehow). Disconnecting the IB card from the IB switch seems to fix the problem. They're going to replace the whole server. Thanks very much for your help. 

    Regards,

    David
    • Marked as answer by Alex Sutton Friday, July 17, 2009 5:55 PM
    Wednesday, July 1, 2009 12:07 AM

All replies

  • Hi David,

    The error you're getting looks like it is related to the IP to IB address translation in the IB drivers.  A few questions:

    1. What version of the IB drivers are you running?
    2. Is the firmware up to date on both the switch as well as the HCAs?

    Thanks,
    -Fab
    Wednesday, June 24, 2009 12:47 AM
  • Thanks Fab.

    The software installed is WinOF_2_0_5, and from device manager, the Driver Assembly Version is 4335.

    I've also got "vstat" and "ibdiagnet" output if you like - it's too big to include inline here.

    Thanks for your help.

    Regards,

    David

    Thursday, June 25, 2009 11:16 PM
  • The software installed is WinOF_2_0_5, and from device manager, the Driver Assembly Version is 4335.
    That looks like the latest.
    I've also got "vstat" and "ibdiagnet" output if you like - it's too big to include inline here.
    vstat output would be nice to see.  Can you ping between the machines?

    Does the error only occur under stress?

    We've seen problems with connectivity when firmware versions were out of date.  While vstat shows the FW version on the HCA, I don't know what tool will show the FW version of the switch - perhaps you can log in to the switch and find out?

    Thanks,
    -Fab


    Monday, June 29, 2009 12:52 AM
  • Hi Fab,
    Our guys appear to have identified the problem - a bad Infiniband card in one of the servers that seems to be affecting the network as a whole (somehow). Disconnecting the IB card from the IB switch seems to fix the problem. They're going to replace the whole server. Thanks very much for your help. 

    Regards,

    David
    • Marked as answer by Alex Sutton Friday, July 17, 2009 5:55 PM
    Wednesday, July 1, 2009 12:07 AM