locked
MPI problem RRS feed

  • Question

  • Hi. I have got an MPI job which makes the following error. I have no problem if I use one node with multiple processes but if I use multiple nodes, I got always the following error. My HPC clusters are using Infiniband. Is there any option I should use for this type of network? I don't know where I should look into.

    job aborted:
    [ranks] message

    [0] fatal error
    Fatal error in MPI_Scatterv: Other MPI error, error stack:
    MPI_Scatterv(358).......................: MPI_Scatterv(sbuf=0x06F47100, scnts=0x02138718, displs=0x0
    1E21180, MPI_DOUBLE, rbuf=0x01C19660, rcount=1, MPI_DOUBLE, root=0, comm=0x84000001) failed
    MPIR_Scatterv(119)......................:
    MPIC_Send(39)...........................:
    MPIC_Wait(277)..........................:
    CH3_ND::CCq::Poll(136)..................:
    CH3_ND::CEndpoint::RecvSucceeded(1476)..:
    CH3_ND::CEndpoint::ProcessReceives(1120):
    CH3_ND::CEndpoint::ProcessDataMsg(1281).:
    MPIDI_CH3_RndvSend(271).................: failure occurred while attempting to send message data
    CH3_ND::CEndpoint::ProcessSends(869)....:
    CH3_ND::CEnvironment::CreateMr(490).....:
    CH3_ND::CMr::Create(91).................:
    CH3_ND::CMr::Init(66)...................:
    CH3_ND::CAdapter::RegisterMemory(293)...: [ch3:nd] INDAdapter::RegisterMemory failed with 0xc0000001


    [1-4] terminated

    Any advice or help will be greatly appreciated. 

    Thanks,
    Jong
    Friday, November 6, 2009 6:10 AM

Answers

  • Hi Jong,

    That error indicates that the memory registration failed (ND_UNSUCCESSFUL.)  Not the most useful error message, sadly.

    How was the buffer you are sending allocated?  How big is it?

    As a temporary work around, you can set MPICH_ND_ZCOPY_THRESHOLD to -1, by passing '-env MPICH_ND_ZCOPY_THRESHOLD -1' to mpiexec.  This will disable the zero-copy path in MSMPI, which eliminates the registration of user-buffers from the I/O path.

    Thanks,
    -Fab

    • Proposed as answer by Don Pattee Wednesday, December 9, 2009 6:20 AM
    • Marked as answer by yyalli Sunday, January 17, 2010 4:20 PM
    Thursday, December 3, 2009 6:17 PM

All replies

  • Hi Jong,

    That error indicates that the memory registration failed (ND_UNSUCCESSFUL.)  Not the most useful error message, sadly.

    How was the buffer you are sending allocated?  How big is it?

    As a temporary work around, you can set MPICH_ND_ZCOPY_THRESHOLD to -1, by passing '-env MPICH_ND_ZCOPY_THRESHOLD -1' to mpiexec.  This will disable the zero-copy path in MSMPI, which eliminates the registration of user-buffers from the I/O path.

    Thanks,
    -Fab

    • Proposed as answer by Don Pattee Wednesday, December 9, 2009 6:20 AM
    • Marked as answer by yyalli Sunday, January 17, 2010 4:20 PM
    Thursday, December 3, 2009 6:17 PM
  • It works perfectly. Thank you so much for your comment.

    One quick question: is there any performance-wise penalty (or gain) by setting MPICH_ND_ZCOPY_THRESHOLD to -1. 

    Thanks again, Fab.

    Jong
    Sunday, January 17, 2010 4:24 PM
  • Hi Jong,

    The MPICH_ND_ZCOPY_THRESHOLD parameter controls when the Network Direct communication channel in MS-MPI switches from copying user data into pre-registered buffers to registering the user buffers directly and avoiding copies both at the sender and receiver.  Setting this environment variable to -1 effectively disables the zero-copy (zcopy) mechanism.

    Whether it affects performance depends on the application - the size of MPI requests, whether the buffer is reused for multiple I/O requests, the frequency of I/O requests, the platform's memory copy performance, etc.  If your application uses the same buffer for multiple I/O requests, generally you will see better performance with zcopy enabled.

    -Fab

    Sunday, January 17, 2010 4:30 PM