Задайте вопросЗадайте вопрос
 

ОтвеченоWHPCS2008 MPI diagnostics failing randomly

  • 19 июня 2009 г. 6:10kellydavid Медали пользователяМедали пользователяМедали пользователяМедали пользователяМедали пользователя
     
    Hi,

    I'm trying to help a customer who is running WHPCS 2008 on an SGI cluster (of Altix XE310 servers). WHPCS 2008 passes all the built-in diagnostic tests except for the MPI ping-pong and lightweight throughput tests. These fail on random nodes. The cluster is setup with a private GigE network for management and an Infiniband (IB) network for MPI traffic.  The IB network is connected to a Cisco SFS7000d switch which us running the subnet manager. 
    When running the MPI ping-pong and lightweight throughput tests, multiple nodes fail randomly. For example, the tests are run on a subset of the cluster works OK, but after they get up to around 14 nodes, the test will fail on a random node. (The cluster has about 20 nodes in total.) However, a "clusrun" of simple commands works OK across all nodes. The log output from the MPI tests is as follows:

    Time      Message

    18/06/2009 12:32:06 PM                Reverted

    18/06/2009 12:32:05 PM                The operation failed due to errors during execution.

    18/06/2009 12:32:05 PM                The operation failed and will not be retried.

    18/06/2009 12:32:05 PM                ---- error analysis -----

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                mpi has detected a fatal error and aborted mpipingpong.exe

    18/06/2009 12:32:05 PM                [5] on MICRO15

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                ---- error analysis -----

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [6-19] terminated

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                CH3_ND::CEndpoint::ConnReqFailed(407): [ch3:nd] INDConnector::Connect to 192.168.2.111:1 failed with 0x80070043

    18/06/2009 12:32:05 PM                CH3_ND::CEndpoint::Connect(236)......:

    18/06/2009 12:32:05 PM                CH3_ND::CEnvironment::Connect(400)...:

    18/06/2009 12:32:05 PM                MPIDI_CH3I_VC_post_connect(426)......: MPIDI_CH3I_Nd_connect failed in VC_post_connect

    18/06/2009 12:32:05 PM                MPIDI_CH3_iSendv(239)................:

    18/06/2009 12:32:05 PM                MPIDI_EagerContigIsend(519)..........: failure occurred while attempting to send an eager message

    18/06/2009 12:32:05 PM                MPIC_Sendrecv(120)...................:

    18/06/2009 12:32:05 PM                MPIR_Allgather(487)..................:

    18/06/2009 12:32:05 PM                MPI_Allgather(864)...................: MPI_Allgather(sbuf=0x000000000022F750, scount=128, MPI_CHAR, rbuf=0x0000000000B71100, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed

    18/06/2009 12:32:05 PM                Fatal error in MPI_Allgather: Other MPI error, error stack:

    18/06/2009 12:32:05 PM                [5] fatal error

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [0-4] terminated

    18/06/2009 12:32:05 PM               

    18/06/2009 12:32:05 PM                [ranks] message

    18/06/2009 12:32:05 PM                job aborted:

    18/06/2009 12:32:05 PM               

    18/06/2009 12:31:50 PM                Connecting to scheduler service on node micro.

     

    Has anyone seen this type f problem? Any suggestions for resolving it?

    Thanks very much.

    Regards,

    David

    • Перемещеноparmita mehtaМодератор23 июня 2009 г. 17:34related to mpi failure (From:Windows HPC Server Deployment, Management, and Administration)
    •  

Ответы

Все ответы