I just got three HP DL380 G7 servers. IT have set it up, all installed MS server HPC 2008 R2 OS with MSMPI.
Head node come with a NIC for enterprise network and one infiniband card to IB switch, 2 compute nodes come with infiniband card connected to IB switch only.
Three MPI tests in cluster manager were passed. IT also ran lizard to test the system, result looks fine.
I ran some small model and they are fine. I ran a large model with sparse solver then it abort.
Here is the error message I got.
job aborted:
[ranks] message
[0-2] terminated
[3] fatal error
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(695).........................: MPI_Bcast(buf=0x00000000DDE7FB04, count=89513974, MPI_INT, root=0, comm=0x84000001) failed
MPIR_Bcast(171).........................:
MPIC_Recv(96)...........................:
CH3_ND::CCq::Poll(136)..................:
CH3_ND::CEndpoint::RecvSucceeded(1479)..:
CH3_ND::CEndpoint::ProcessReceives(1153):
CH3_ND::CEndpoint::ReadToIov(1794)......:
CH3_ND::CEndpoint::Read(1716)...........:
CH3_ND::CEnvironment::CreateMr(489).....:
CH3_ND::CMr::Create(91).................:
CH3_ND::CMr::Init(66)...................:
CH3_ND::CAdapter::RegisterMemory(292)...: [ch3:nd] INDAdapter::RegisterMemory failed with 0xc0000001
[4-29] terminated
---- error analysis -----
[3] on NKTCCH01
mpi has detected a fatal error and aborted
\\NKTCCH01\Ansys Inc\v140\ANSYS\bin\winx64\ANSYS.EXE
---- error analysis -----
********** End of ANSYS Execution **********
Wed 01/25/2012