locked
MPI_Reduce is blocking for minutes in MSMPI v7 RRS feed

  • Question

  • Hello, I'm trying to implement a new MPI code that uses an asynchronous reduce operation, I installed version 7 of MS-MPI and I have a runtime issue: the code stops executing on reduce operations when two process are running on two different computers for a few minutes and then it continues executing correctly.

    I was not able to get a confirmation of data sent in IReduce with Test or Wait, so I just put the Wait after the IReduce and it never seemed to return. I finally made a sample program that uses Reduce and it's the same, it blocks on the first call. However, after something like 5 minutes, the call completes and everything starts working fine. Am I doing something wrong, or is v7 Reduce not working?

    I have tested with a root on Windows 8.1 and the other computer on 8.1 or 7 and it's the same. I haven't tested with MS-MPI v6 yet, I'm not sure I can downgrade. I have no firewall activated on any machine.

    You'll find the code below. I run the two processes with the following command: mpiexec -hosts 2 machine1 machine2 'C:\Users\me\path\to\MPITests.exe'

    #include "stdafx.h"
    #define COUNT 262144
    
    using namespace std;
    
    int main(int argc, char **argv) {
    	setvbuf(stdout, NULL, _IOLBF, 80);
    
    	MPI_Init(&argc, &argv);
    	int rank;
    	int nproc;
    	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    	MPI_Comm_size(MPI_COMM_WORLD, &nproc);
    	cout << "Node " << rank << "/" << nproc << " online." << endl;
    
    	float *f = new float[COUNT];
    	for (int i = 0; i < COUNT; i++) {
    		if (rank == 0)
    			f[i] = i / 2;
    		else f[i] = i;
    	}
    
    	float *r = new float[COUNT];
    
    	cout << "calling reduce" << endl;
    	MPI_Reduce(f, r, COUNT, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
    	cout << "reduce ended" << endl;
    
    	if (rank == 0) {
    		for (int i = 0; i < 20; i++)
    			cout << r[i] << " ";
    		cout << endl;
    	}
    
    	MPI_Finalize();
    }

    Thanks!


    • Edited by 6trouille Friday, December 4, 2015 6:23 PM code simplified
    Friday, December 4, 2015 2:10 PM

Answers

  • Hi there,

    1) Does the issue happen if you run it within a single node (mpiexec -n 2 C:\Users\me\path\to\MPITests.exe)?

    2) If not, can you try running mpiexec -env MSMPI_DISABLE_SHM 1 -env MSMPI_DISABLE_SOCK 0 -n 2 c:\users\me\path\to\mpitest.exe and see if the issue happens ?

    3) If not, can you start the smpd daemon with a debug level of 3 (smpd -d 3) and also run the mpiexec command with a -d 3 (mpiexec -d 3 -hosts 2 machine1 1 machine2 1) and monitor the output before and after things appear hanging (you mentioned it would hang for several minutes so there should be enough time to copy the output before it hangs to a text file).

    Thanks

    Anh

    • Marked as answer by 6trouille Friday, December 11, 2015 5:15 PM
    Wednesday, December 9, 2015 6:07 AM

All replies

  • Hi there,

    1) Does the issue happen if you run it within a single node (mpiexec -n 2 C:\Users\me\path\to\MPITests.exe)?

    2) If not, can you try running mpiexec -env MSMPI_DISABLE_SHM 1 -env MSMPI_DISABLE_SOCK 0 -n 2 c:\users\me\path\to\mpitest.exe and see if the issue happens ?

    3) If not, can you start the smpd daemon with a debug level of 3 (smpd -d 3) and also run the mpiexec command with a -d 3 (mpiexec -d 3 -hosts 2 machine1 1 machine2 1) and monitor the output before and after things appear hanging (you mentioned it would hang for several minutes so there should be enough time to copy the output before it hangs to a text file).

    Thanks

    Anh

    • Marked as answer by 6trouille Friday, December 11, 2015 5:15 PM
    Wednesday, December 9, 2015 6:07 AM
  • Thanks for the answer Anh.

    1) no problem with the two programs on the same node

    2) the issue doesn't happen with the shared memory disabled and the sock enabled.

    3) I have a log on both computers, and finally you won't need them, I found the issue here:

    [01:7740] Handling SMPD_BCGET command from smpd 2
            ctx_key=0
            rank=0
            value=port=34628 description="192.168.52.1 192.168.43.1 ANOTHER_IP MACHINE_HOSTNAME " shm_host=MACHINE_HOSTNAME shm_queue=7476:176
            result=success

    The problem is that the two interfaces with IPs 192.168.52.1 and 192.168.43.1 are VMWare virtual Ethernet adapters and I was testing with no VMWare launched so they basically don't respond. Disabling them or using the -env MPICH_NETMASK variable to match the correct interface fixes the issue.

    Thanks a lot!

    Friday, December 11, 2015 5:15 PM
  • Excellent! I'm glad the issue has been resolved!
    Thursday, December 17, 2015 10:30 PM