none
MPI error on Ansys sparse solver with large model RRS feed

  • Question

  • I just got three HP DL380 G7 servers. IT have set it up, all installed MS server HPC 2008 R2 OS with MSMPI.

    Head node come with a NIC for enterprise network and one infiniband card to IB switch, 2 compute nodes come with infiniband card connected to IB switch only.

    Three MPI tests in cluster manager were passed. IT also ran lizard to test the system, result looks fine.

    I ran some small model and they are fine. I ran a large model with sparse solver then it abort.

    Here is the error message I got.

     

    job aborted:
    [ranks] message

    [0-2] terminated

    [3] fatal error
    Fatal error in PMPI_Bcast: Other MPI error, error stack:
    PMPI_Bcast(695).........................: MPI_Bcast(buf=0x00000000DDE7FB04, count=89513974, MPI_INT, root=0, comm=0x84000001) failed
    MPIR_Bcast(171).........................:
    MPIC_Recv(96)...........................:
    CH3_ND::CCq::Poll(136)..................:
    CH3_ND::CEndpoint::RecvSucceeded(1479)..:
    CH3_ND::CEndpoint::ProcessReceives(1153):
    CH3_ND::CEndpoint::ReadToIov(1794)......:
    CH3_ND::CEndpoint::Read(1716)...........:
    CH3_ND::CEnvironment::CreateMr(489).....:
    CH3_ND::CMr::Create(91).................:
    CH3_ND::CMr::Init(66)...................:
    CH3_ND::CAdapter::RegisterMemory(292)...: [ch3:nd] INDAdapter::RegisterMemory failed with 0xc0000001

    [4-29] terminated

    ---- error analysis -----

    [3] on NKTCCH01
    mpi has detected a fatal error and aborted \\NKTCCH01\Ansys Inc\v140\ANSYS\bin\winx64\ANSYS.EXE

    ---- error analysis -----

     ********** End of ANSYS Execution **********
    Wed 01/25/2012

     

    Friday, February 3, 2012 11:21 PM

All replies

  • What version of MSMPI are you running with?  This should be fixed in SP3, though you'll need to update the whole HPC Pack to SP3 not just the MSMPI bits.  We added some logic to better support registering memory when the underlying drivers can't secure the memory range to allow us to cache the registrations.

    You can work around the problem by running your job with '-env MPICH_ND_ZCOPY_THRESHOLD -1' added to your mpiexec command line.  It's worth testing with and without zcopy turned on to see which performs better.

    Let us know if that works for you.

    Cheers,
    -Fab

    Monday, February 13, 2012 5:06 AM
  • The Windows is Version 6.1.7601 Service Pack 1 Build 7601.

    MSMPI service verision is 3.2.3716.0, Date modified  6/23/2011.

    I have not try the '-env MPICH_ND_ZCOPY_THRESHOLD -1' , I will let you know if it works.

    Thank you!

    Friday, March 2, 2012 5:04 PM