none
Job Submit failed RRS feed

Answers

  • There is a deadlock in your code so the job still hang.
    Looking at the code in the if...else... block when myid == 0

    if

     

     

    (i%4 == 0){
        phi = i-
    180;
    }
    else {
        ......
        MPI_Send ((
    void *)&phi, 1, MPI_INT, j, itag, MPI_COMM_WORLD);
    }

    Note that the MPI_Recv is a blocking call. There is no MPI_Send() when i%4 == 0. However, there is always the call of MPI_Recv() when myid != 0. So deadlock happened:
    1. No MPI_Send() for root (myid = 0) since i%4 == 0;
    2. Rank other than root will call MPI_Recv which is suppose to send from root

    To solve the problem, you need redesign your code to make sure the root process called MPI_Send() before the MPI_Recv call of other processes.

    Another suggestion, don't use the hard coded value for number of ranks in your cluster, using numprocs instead. (in the loop of send, you better use for (j = 1; j < numprocs; j++)).

    Thanks,
    James


    Thank you !!
    I will try your way , and than report to you.

    the problem solve by redesign code :
    for (i = 0; i <= 360; i++) 
    • Marked as answer by YuJinSu Wednesday, March 3, 2010 12:32 PM
    Monday, March 1, 2010 7:12 AM

All replies

  • Hi Yujin,

    the error code is from missed VC CRT library on some or all of the compute nodes.

    suggestions:
    1) logon to each compute node, run your program from cmdline directly: mpiexec -n 2 SARMPI.exe. If this pass, then go ahead to run from job scheduler.
    2) or you can use clusrun to check which compute node is ok or not: clusrun mpiexec -n 2 SARMPI.exe

    hope it helps,

    Liwei
    Thursday, February 25, 2010 3:50 AM
  • I remember the first time submit job will be asked to enter account password, but there is no.
    Is it  reason ?
    http://img444.imageshack.us/img444/5605/123yf.jpg

    Thursday, February 25, 2010 7:50 AM
  • Hi Yujin,

    the error code is from missed VC CRT library on some or all of the compute nodes.

    suggestions:
    1) logon to each compute node, run your program from cmdline directly: mpiexec -n 2 SARMPI.exe. If this pass, then go ahead to run from job scheduler.
    2) or you can use clusrun to check which compute node is ok or not: clusrun mpiexec -n 2 SARMPI.exe

    hope it helps,

    Liwei

    When I install Microsoft Visual Studio 2008 on my computer node that can be solve the problem.
    logon to each compute node, run your program from cmdline directly: mpiexec -n 2 SARMPI.exe  is successful.

    But  whtn I use HPC Cluster Manager , its state always  running .
    How can I slove the problem !?

    http://img40.imageshack.us/img40/6138/picturebc.jpg 
    Thursday, February 25, 2010 11:47 AM
  • The MPI job is hang so the job state keep as running. I was wondering how it can succeed with "mpiexec -n 2 SARMPI.exe "
    .I just took a quick look of your source code, at least one point will cause hang:

    if (myid == 0)
    {
        doSomething();
    }  
    else
    {
        doSomethingElse();
        MPI_Barrier(MPI_COMM_WORLD);
    }

    The MPI_Barrier will not return until all the MPI processes reached here. However, there is no way that rank 0 can reach this else block.
    Remove it and try again.

    Thanks,
    James

    Thursday, February 25, 2010 6:11 PM
  • The MPI job is hang so the job state keep as running. I was wondering how it can succeed with "mpiexec -n 2 SARMPI.exe "
    .I just took a quick look of your source code, at least one point will cause hang:

    if (myid == 0)
    {
        doSomething();
    }  
    else
    {
        doSomethingElse();
        MPI_Barrier(MPI_COMM_WORLD);
    }

    The MPI_Barrier will not return until all the MPI processes reached here. However, there is no way that rank 0 can reach this else block.
    Remove it and try again.

    Thanks,
    James


    ths !!
    But I try remove MPI_Barrier(MPI_COMM_WORLD);
    The job state keep as runnung .... >"<
    Friday, February 26, 2010 2:43 AM
  • There is a deadlock in your code so the job still hang.
    Looking at the code in the if...else... block when myid == 0

    if

     

     

    (i%4 == 0){
        phi = i-
    180;
    }
    else {
        ......
        MPI_Send ((
    void *)&phi, 1, MPI_INT, j, itag, MPI_COMM_WORLD);
    }

    Note that the MPI_Recv is a blocking call. There is no MPI_Send() when i%4 == 0. However, there is always the call of MPI_Recv() when myid != 0. So deadlock happened:
    1. No MPI_Send() for root (myid = 0) since i%4 == 0;
    2. Rank other than root will call MPI_Recv which is suppose to send from root

    To solve the problem, you need redesign your code to make sure the root process called MPI_Send() before the MPI_Recv call of other processes.

    Another suggestion, don't use the hard coded value for number of ranks in your cluster, using numprocs instead. (in the loop of send, you better use for (j = 1; j < numprocs; j++)).

    Thanks,
    James

    Friday, February 26, 2010 7:01 AM
  • There is a deadlock in your code so the job still hang.
    Looking at the code in the if...else... block when myid == 0

    if

     

     

    (i%4 == 0){
        phi = i-
    180;
    }
    else {
        ......
        MPI_Send ((
    void *)&phi, 1, MPI_INT, j, itag, MPI_COMM_WORLD);
    }

    Note that the MPI_Recv is a blocking call. There is no MPI_Send() when i%4 == 0. However, there is always the call of MPI_Recv() when myid != 0. So deadlock happened:
    1. No MPI_Send() for root (myid = 0) since i%4 == 0;
    2. Rank other than root will call MPI_Recv which is suppose to send from root

    To solve the problem, you need redesign your code to make sure the root process called MPI_Send() before the MPI_Recv call of other processes.

    Another suggestion, don't use the hard coded value for number of ranks in your cluster, using numprocs instead. (in the loop of send, you better use for (j = 1; j < numprocs; j++)).

    Thanks,
    James


    Thank you !!
    I will try your way , and than report to you.

    the problem solve by redesign code :
    for (i = 0; i <= 360; i++) 
    • Marked as answer by YuJinSu Wednesday, March 3, 2010 12:32 PM
    Monday, March 1, 2010 7:12 AM