locked
"Invalid displacement argument in RMA call, error stack" given when a thread calls MPI_IProbe. RRS feed

  • Question

  • Hi.

    I am using MPI_IProbe to poll for error messages in my HPC application.

    Under certain circumstances (typically compute / communication intensive situations) we see the error message

    job aborted: [ranks] message [0] fatal error Fatal error in MPI_Iprobe: Invalid displacement argument in RMA call, error stack: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=2147483647, comm=0x84000000, flag=0x00000055C13FEC08, status=0x00000055C13FF920) failed (unknown)(): Invalid displacement argument in RMA call [1-5] terminated ---- error analysis ----- [0] on NODE mpi has detected a fatal error and aborted

    Have you seen this before and is there a workaround for this?

    Kind Regards,

    Renier

    Thursday, June 28, 2018 12:56 PM

Answers

  • Hi Renier,

    This has been fixed in our latest release (MSMPI v10.0).
    Please find the download link from https://docs.microsoft.com/en-us/message-passing-interface/microsoft-mpi.

    MSMPI is now open-source, please find us on GitHub - https://github.com/Microsoft/Microsoft-MPI

    Best,
    Jithin

    • Marked as answer by RenierM Tuesday, November 20, 2018 8:33 AM
    Friday, November 9, 2018 7:10 PM

All replies

  • Hi Renier,

    Thanks for reporting. The error message is from a standard error checking of window displacement. I'm doubting it is hitting a race condition somehow.
    Is there a mimic version of your app code that you can share with us (or even a simple reproducer)?

    Thanks,
    Jithin

    Friday, June 29, 2018 4:28 PM
  • Hi Jithin

    Thanks for the reply. I will try to create a reproducer. From what I have subsequently seen it appears that it only occurs when MKL routines such as cgetrf are called.

    Thanks,

    Renier

    Tuesday, July 3, 2018 10:58 AM
  • Hi Jithin

    See the reproducer below that does not include MKL routines, clearly showing that the error originates from MS-MPI.

    This is built using the command:

    icl pingpong.cpp -o pingpong.exe msmpi.lib -I$MSMPIINCDIR /link /LIBPATH:$MSMPILIBDIR

    Thanks

    #include <windows.h> /* generally required for Windows, otherwise wincon.h gives errors */
    #include <stdint.h>
    #include <stdio.h>
    #include "mpi.h"
    #include <omp.h>
    
    
    unsigned long error_thread_id=0;
    
    typedef int32_t integer;
    typedef int32_t logical;
    
    integer time_for_heartbeat_msec = 100;
    logical keep_running_thread = 1;
    
    
    // =========================================================================================
    void geterr()
    {
            int flag;
            MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, MPI_STATUS_IGNORE);
    }
    
    // =========================================================================================
    DWORD WINAPI error_heartbeat(LPVOID lpParam)
    {
      integer *wait_msec = (integer *)lpParam;
      error_thread_id = GetCurrentThreadId();
      static int cnt = 0;
    
      Sleep(*wait_msec);
      while (keep_running_thread)
      {
        printf("tick %d\n", cnt++);
        geterr();
        fflush(stdout);
        Sleep(*wait_msec);
      }
      return 0;
    }
    
    // =========================================================================================
    void make_error_thread()
    {
      DWORD  dwThread;
    
      integer *wait_msec = (integer *)malloc(1*sizeof(integer));
      *wait_msec = time_for_heartbeat_msec;
    
      HANDLE hThread = CreateThread(
                NULL,                 // default security attributes
                0,                    // use default stack size
                error_heartbeat,      // thread function name
                wait_msec,            // pointer argument to thread function
                0,                    // use default creation flags
                &dwThread);           // returns the thread identifier
    
      if (hThread == NULL)
      {
        printf("Could not start error_thread!\n");
        exit (1);
      }
    }
    
    // =========================================================================================
    void stop_error_thread()
    {
      keep_running_thread = 0;
      Sleep(10);
    }
    // =========================================================================================
    
    
    const int ARRSIZE = 128*1024*1024;
    const int MAX_ROUND = 42;
    
    int main()
    {
        int id;
        int procCount;
        int provided;
    
        MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
        MPI_Comm_size(MPI_COMM_WORLD, &procCount);
        MPI_Comm_rank(MPI_COMM_WORLD, &id);
    
        make_error_thread();
    
        int* arr = new int[ARRSIZE];
        int* buf = new int[ARRSIZE];
    
        int otherProc = (id+1)%2;
        MPI_Request request_dummy;
        
        for (int i = 0; i < MAX_ROUND; ++i)
        {
            MPI_Isend(arr, ARRSIZE, MPI_INT, otherProc, 0, MPI_COMM_WORLD,&request_dummy);
            MPI_Recv(buf, ARRSIZE, MPI_INT, otherProc, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        }
    
        MPI_Finalize();
        return 0;
    }

    Wednesday, September 5, 2018 6:33 AM
  • Hi Renier,

    This has been fixed in our latest release (MSMPI v10.0).
    Please find the download link from https://docs.microsoft.com/en-us/message-passing-interface/microsoft-mpi.

    MSMPI is now open-source, please find us on GitHub - https://github.com/Microsoft/Microsoft-MPI

    Best,
    Jithin

    • Marked as answer by RenierM Tuesday, November 20, 2018 8:33 AM
    Friday, November 9, 2018 7:10 PM
  • Thank you very much, this seems to fix the problem for the reproducer.
    Tuesday, November 20, 2018 8:33 AM