Answered by:
MPI problem

Question
-
Hi. I have got an MPI job which makes the following error. I have no problem if I use one node with multiple processes but if I use multiple nodes, I got always the following error. My HPC clusters are using Infiniband. Is there any option I should use for this type of network? I don't know where I should look into.job aborted:[ranks] message[0] fatal errorFatal error in MPI_Scatterv: Other MPI error, error stack:MPI_Scatterv(358).......................: MPI_Scatterv(sbuf=0x06F47100, scnts=0x02138718, displs=0x01E21180, MPI_DOUBLE, rbuf=0x01C19660, rcount=1, MPI_DOUBLE, root=0, comm=0x84000001) failedMPIR_Scatterv(119)......................:MPIC_Send(39)...........................:MPIC_Wait(277)..........................:CH3_ND::CCq::Poll(136)..................:CH3_ND::CEndpoint::RecvSucceeded(1476)..:CH3_ND::CEndpoint::ProcessReceives(1120):CH3_ND::CEndpoint::ProcessDataMsg(1281).:MPIDI_CH3_RndvSend(271).................: failure occurred while attempting to send message dataCH3_ND::CEndpoint::ProcessSends(869)....:CH3_ND::CEnvironment::CreateMr(490).....:CH3_ND::CMr::Create(91).................:CH3_ND::CMr::Init(66)...................:CH3_ND::CAdapter::RegisterMemory(293)...: [ch3:nd] INDAdapter::RegisterMemory failed with 0xc0000001[1-4] terminatedAny advice or help will be greatly appreciated.Thanks,JongFriday, November 6, 2009 6:10 AM
Answers
-
Hi Jong,
That error indicates that the memory registration failed (ND_UNSUCCESSFUL.) Not the most useful error message, sadly.
How was the buffer you are sending allocated? How big is it?
As a temporary work around, you can set MPICH_ND_ZCOPY_THRESHOLD to -1, by passing '-env MPICH_ND_ZCOPY_THRESHOLD -1' to mpiexec. This will disable the zero-copy path in MSMPI, which eliminates the registration of user-buffers from the I/O path.
Thanks,
-Fab- Proposed as answer by Don Pattee Wednesday, December 9, 2009 6:20 AM
- Marked as answer by yyalli Sunday, January 17, 2010 4:20 PM
Thursday, December 3, 2009 6:17 PM
All replies
-
Hi Jong,
That error indicates that the memory registration failed (ND_UNSUCCESSFUL.) Not the most useful error message, sadly.
How was the buffer you are sending allocated? How big is it?
As a temporary work around, you can set MPICH_ND_ZCOPY_THRESHOLD to -1, by passing '-env MPICH_ND_ZCOPY_THRESHOLD -1' to mpiexec. This will disable the zero-copy path in MSMPI, which eliminates the registration of user-buffers from the I/O path.
Thanks,
-Fab- Proposed as answer by Don Pattee Wednesday, December 9, 2009 6:20 AM
- Marked as answer by yyalli Sunday, January 17, 2010 4:20 PM
Thursday, December 3, 2009 6:17 PM -
It works perfectly. Thank you so much for your comment.One quick question: is there any performance-wise penalty (or gain) by setting MPICH_ND_ZCOPY_THRESHOLD to -1.Thanks again, Fab.JongSunday, January 17, 2010 4:24 PM
-
Hi Jong,
The MPICH_ND_ZCOPY_THRESHOLD parameter controls when the Network Direct communication channel in MS-MPI switches from copying user data into pre-registered buffers to registering the user buffers directly and avoiding copies both at the sender and receiver. Setting this environment variable to -1 effectively disables the zero-copy (zcopy) mechanism.
Whether it affects performance depends on the application - the size of MPI requests, whether the buffer is reused for multiple I/O requests, the frequency of I/O requests, the platform's memory copy performance, etc. If your application uses the same buffer for multiple I/O requests, generally you will see better performance with zcopy enabled.
-FabSunday, January 17, 2010 4:30 PM