none
MPI runs with 4 processors but it does not with 8 processors

    Question

  • Hi,

    I implemented a parallel algorithm in a computational fluid dynamics fortran 90 code using MPI. It used to run with as many processors as I wanted (I tried upto 16 processors). Recently I've made some changes in the code to improve its abilites. It's to much detail to post it here, however, what I can say is after the improvements in the code, it no longer runs with 8 processors. It runs with 4 processors tho. I'm not good at MPI debugging, therefore, I prefer using 'pause' statements to find out where the problem occurs. However, the loaction of the problem seems inconsistent. To be more clear, let's say in one run the code comes to the point I put the 'pause' statement. However, in the second attempt, it doesn't come there. It gives errors. It doesn't make sense to me, and I can't see what's the problem now. I spent so much effort to fix the problem, but I'm kinda frustrated. If any of you have suggestions for the steps I need to take to solve this problem, I will deeply appreciate it.

    Below is the error I received from a run with 8 processors: I can provide more detail if you need.

     MPI initialization success
               8 processors were assigned for the task!
    FORTRAN PAUSE
    PAUSE prompt> FORTRAN PAUSE
    PAUSE prompt> FORTRAN PAUSE
    PAUSE prompt> FORTRAN PAUSE
    PAUSE prompt> FORTRAN PAUSE
    PAUSE prompt> FORTRAN PAUSE
    PAUSE prompt> FORTRAN PAUSE
    PAUSE prompt> p1_17433:  p4_error: net_recv read:  probable EOF on socket: 1
    [ekaraism@nifty DREAMf90_cavity_scalar_qsou_par]$ rm_l_2_17571: (0.671875) net_send: could not write to fd=6, errno = 9
    rm_l_2_17571:  p4_error: net_send write: -1
        p4_error: latest msg from perror: Bad file descriptor
    rm_l_5_17784: (0.277344) net_send: could not write to fd=6, errno = 9
    rm_l_5_17784:  p4_error: net_send write: -1
        p4_error: latest msg from perror: Bad file descriptor
    rm_l_3_17642: (0.539062) net_send: could not write to fd=6, errno = 9
    rm_l_3_17642:  p4_error: net_send write: -1
        p4_error: latest msg from perror: Bad file descriptor
    rm_l_6_17855: (0.148438) net_send: could not write to fd=6, errno = 9
    rm_l_6_17855:  p4_error: net_send write: -1
        p4_error: latest msg from perror: Bad file descriptor
    rm_l_4_17713: (0.410156) net_send: could not write to fd=6, errno = 9
    rm_l_4_17713:  p4_error: net_send write: -1
        p4_error: latest msg from perror: Bad file descriptor
    rm_l_7_17926: (0.019531) net_send: could not write to fd=6, errno = 9
    rm_l_7_17926:  p4_error: net_send write: -1
        p4_error: latest msg from perror: Bad file descriptor
    rm_l_1_17500: (0.800781) net_send: could not write to fd=5, errno = 32
    p1_17433: (0.804688) net_send: could not write to fd=5, errno = 32

     

    Wednesday, March 24, 2010 3:25 AM

Answers

  • This kind of symptom with getting inconsistent errors and breaks in your code (due to your pause statements) are an indication of a race condition.

    However, without having seen your code I cannot know for sure.

    Liwei mentiones the use of a Fortran debugger, which is a good indication - I would also check the types of messages that you are using within your MPI code (blocking vs non-blocking).  Another thing I would try is to see where it failes.  Presumibly it works on 1 through 4 processors.  Instead of jumping to 8, perhaps iterations from 5 through 8 would give you an indication of where the trouble is.

     

    Just some additional ideas - hope it helps.

     

    Mark

     

    Friday, April 16, 2010 4:59 PM

All replies

  • Hi, I recommend that you use a Fortran debugger to debug the issue. For example,

    - PGI has a visual Fortran debugger. http://www.pgroup.com/products/pvf.htm

    - Intel has one too. http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_suite_win_brief.pdf

    Hope the above tools help

    Liwei

     

    Wednesday, March 24, 2010 6:28 PM
  • This kind of symptom with getting inconsistent errors and breaks in your code (due to your pause statements) are an indication of a race condition.

    However, without having seen your code I cannot know for sure.

    Liwei mentiones the use of a Fortran debugger, which is a good indication - I would also check the types of messages that you are using within your MPI code (blocking vs non-blocking).  Another thing I would try is to see where it failes.  Presumibly it works on 1 through 4 processors.  Instead of jumping to 8, perhaps iterations from 5 through 8 would give you an indication of where the trouble is.

     

    Just some additional ideas - hope it helps.

     

    Mark

     

    Friday, April 16, 2010 4:59 PM