none
MSMPI with Infiniband card has poor performance on Win10 RRS feed

  • Question

  • Aloha,

    We are using two Mellanox ConnectX-3 connected by IB switch,  the node configurations are Intel Xeon(R) CPU E3-1230 v3 with 4 cores, Win10, MSMPI only(none of HPC).

    We try to send data from one node to another using simple MPI_Send/MPI_Recv, the data size is 400MB, which cost about 2.8s. It seems that the transport speed is only 150MB/s approximately, it's too slow. Does anyone knows how to improve it? 

    Thanks!

    Saturday, May 19, 2018 12:13 PM

Answers

  • This hints that Netdirect is not installed/configured correctly. Please see this post - https://social.microsoft.com/Forums/en-US/20aa7f04-2c58-4bb3-8c6d-12296a965934/typical-mpi-pingpong-latency-over-roce-with-networkdirect?forum=windowshpcmpi. 

    Can you check your ND installation and make sure ND benchmarks run fine


    • Edited by JithinJos Wednesday, May 23, 2018 4:58 PM
    • Marked as answer by Freeny Q Thursday, May 24, 2018 8:10 AM
    Wednesday, May 23, 2018 4:58 PM

All replies

  • Can you try specifying these options to disable sockets and enable netdirect? (/env MSMPI_DISABLE_ND 0 /env MSMPI_DISABLE_SOCK 1)?
    Please share the results of mpipingpong with these options.

    Thanks,
    Jithin

    Monday, May 21, 2018 5:17 PM
  • Thanks!

    I find this methods as well, but there is something wrong when I use it. that's the result of command:

    job aborted:
    [ranks] message
    [0] fatal error
    Fatal error in MPI_Send: Other MPI error, error stack:
    MPI_Send(buf=0x000001C680009070, count=419430400, MPI_CHAR, dest=1, tag=4, MPI_COMM_WORLD) failed
    unable to connect to 192.168.18.4 11.4.12.11 DESKTOP-ME7BMV4  on port 0, the socket interconnect is disabled
    [1] terminated
    ---- error analysis -----
    [0] on 11.4.12.10
    mpi has detected a fatal error and aborted F:\MSMPI\x64\Debug\MSMPI.exe
    ---- error analysis -----

    The command is:

    mpiexec.exe -hosts 2 11.4.12.10 11.4.12.11 -env MPICH_NETMASK 11.4.12.10/255.255.255.0 -env MPICH_ND_ZCOPY_THRESHOLD -1 -env MPICH_DISABLE_ND 0 -env MPICH_DISABLE_SOCK 1 -affinity "$(TargetPath)"

    Have you met this problem before? or do you know if there should be any special configuration on both host to open the ND mode?



    • Edited by Freeny Q Wednesday, May 23, 2018 12:54 PM
    Wednesday, May 23, 2018 12:40 PM
  • This hints that Netdirect is not installed/configured correctly. Please see this post - https://social.microsoft.com/Forums/en-US/20aa7f04-2c58-4bb3-8c6d-12296a965934/typical-mpi-pingpong-latency-over-roce-with-networkdirect?forum=windowshpcmpi. 

    Can you check your ND installation and make sure ND benchmarks run fine


    • Edited by JithinJos Wednesday, May 23, 2018 4:58 PM
    • Marked as answer by Freeny Q Thursday, May 24, 2018 8:10 AM
    Wednesday, May 23, 2018 4:58 PM
  • Very very grateful, that helps me.
    Thursday, May 24, 2018 8:11 AM