none
HPC server 2008, problems occur when submit MPI jobs RRS feed

  • Question

  •  Hi, everyone,

    I have just installed the HPC Pack 2008 Beta 2, and  a  helloworld program has been written with VS2005.  Compiling is ok, buiding is ok, but when i trying to submit the job with the Job Management in the HPC Cluster Manager, it failed with this error information:


    job aborted:
    [ranks] message

    [0] fatal error
    Fatal error in MPI_Send: Invalid rank, error stack:
    MPI_Send(175): MPI_Send(buf=0x000000000012FEE8, count=17, MPI_CHAR, dest=1, tag=99, MPI_COMM_WORLD) failed
    MPI_Send(101): Invalid rank has value 1 but must be nonnegative and less than 1

    I've assigned this job with two cores in just one cmputer node, when i changed the resource to node, this error happened again. i really don't know why.

    The HPC cluster is completely a whole new stuff to me, would anyone give me some suggestions?
    My job details are as follows:
    Job name hello
    Job template Default

    Task List
    task name Hello Task
    Cmmand line helloworld.exe
    Working Directory //PCCluster4/c/mpidebug/helloworld/x64/debug

    best wishes

    Friday, July 18, 2008 6:24 AM

Answers

  • Hi Casey,

    This error happens because the debug version of the CRT (C Runtime) is not deployed to the cluster; and you are running the debug version of your app. either,
    1. run the release version,
    2. link with the static version of the CRT,
    3. remove the manifest from your binary (or the manifest file from the Debug folder)

    note that if you are using VS2008, you need a similar solution as the VS2008 binaries are not there by default with Windows HPC Server.

    thanks,
    .Erez
    Sunday, July 27, 2008 3:56 AM

All replies

  • another thing:
    when i used this command, something happened

    'mpiexec -d 2 C:\mpidebug\helloworld\x64\debug\helloworld.exe'

    does that mean ok?
    • Edited by Casey.Zhang Friday, July 25, 2008 3:44 PM delete code
    Friday, July 18, 2008 6:36 AM
  • Hi, Casey.

    This:    
        [0] fatal error
        Fatal error in MPI_Send: Invalid rank, error stack:
    means that you've attempted to send data to a non-existent rank. 

    The problem was caused by: 
        Cmmand line helloworld.exe
    which ran helloworld.exe but not using the MPI stack.  Thus, only one rank was created on one of the nodes and the send failed.  The HPCS2008 scheduler is a general purpose scheduler capable of running many types of jobs (not just MPI jobs).  So you'll need to call out use of the MPI stack in your commmand line like this: 
        Cmmand line    mpiexec helloworld.exe

    Your second question was about using "mpiexec -d ...".  That ton of output you see is because you've asked MS-MPI to run in debug mode (-d) so you're seeing a ton of messages.  But because you've used mpiexec in the command line, helloworld.exe probably ran just fine. 


    Eric Lantz (Microsoft)
    Friday, July 18, 2008 6:51 PM
  •  Hi, Eric
    Thanks for your advice, I created a new job just as what you said, but the job was canceled with this Error message: Canceled by the scheduler because the job's run time expired.

    i really don't know why, it's just a simple 'hello'program. Is there any specific configuration about MSMPI working with vs2005?
    Thursday, July 24, 2008 3:13 PM
  • Hi Casey,

    No specific configuration is required for MSMPI to work with VS2005.

    The error message you saw usually means that a Runtime is given to the job. And the job is not finished yet when the runtime expred.

    Can you send me the output from "job view JOBID /detailed"?

    Thanks,

    Liwei Peng (Microsoft HPC)

    Thursday, July 24, 2008 10:03 PM
  • Hi, Liwei
    i didn't see anything useful in the View Job. The task details has no Pending reason. all i get is just like this:
    TaskID:1          Unit type: core         Run time: Infinit
    Rerunnable: True   Max: 4
    Exclusive: False      Min:  1

    there is no pending reason

    command line: mpiexec hello.exe
    working directory: \\PCCluster4\c \mpidebug\hello\x64\debug

    i tried to cancle the limit of run time, but the job state was Running and never got stopped.


    then i right clicked the headnode in the Node Management and chose 'run a command', then i wrote'mpiexec -d 2 \\PCCluster4\c\mpidebug\hello\x64\debug\hello.exe', the cliked 'run', i got these massage :

    PCCLUSTER4 -> Failed

    job aborted:
    [ranks] message

    [0] fatal error
    Fatal error in MPI_Send: Invalid rank, error stack:
    MPI_Send(175): MPI_Send(buf=0x000000000012FE28, count=17, MPI_CHAR, dest=1, tag=99, MPI_COMM_WORLD) failed
    MPI_Send(101): Invalid rank has value 1 but must be nonnegative and less than 1

    ---- error analysis -----

    [0] on PCCLUSTER4
    mpi has detected a fatal error and aborted \\PCCluster4\C\mpidebug\hello\x64\debug\hello.exe

    ---- error analysis -----
    [00:1908] last process exited, tearing down the job tree.
    [00:1908] posting a read for a command header on the left child context, sock 328
    [00:1908] wrote command
    [00:1908] command written to left child: "cmd=close src=0 originalAttribute="src" originalPath="0" originalAttribute="src" originalPath="0" originalAttribute="src" originalPath="0" dest=1 tag=6 "
    [00:1908] read command header
    [00:1908] command header read, posting read for data: 31 bytes
    [00:1908] read command
    [00:1908] read command: "cmd=closed src=1 originalAttribute="src" originalPath="1" originalAttribute="src" originalPath="1" originalAttribute="src" originalPath="1" dest=0 tag=5 "
    [00:1908] handling command:
    [00:1908]  src  = 1
    [00:1908]  dest = 0
    [00:1908]  cmd  = closed
    [00:1908]  tag  = 5
    [00:1908]  ctx  = left child
    [00:1908]  str  = cmd=closed src=1 originalAttribute="src" originalPath="1" originalAttribute="src" originalPath="1" originalAttribute="src" originalPath="1" dest=0 tag=5
    [00:1908] 0 -> 0 : returning NULL context
    [00:1908] closed command received from left child, closing sock.

    Task failed during execution. Please check task's output for error details.
    ------------------------------------------------------------------------------------------------------------------------

    • Edited by Casey.Zhang Friday, July 25, 2008 3:46 PM delete code
    Friday, July 25, 2008 7:27 AM
  • Hi, 
    I think the first tow problems have been solved. Maybe i made some stupid mistakes in the configuration of VS2005 with MSMPI. finally i followed this instruction http://www.cs.utah.edu/~delisi/vsmpi/, and changed the platform to Active(x64).
    I'll appreciate if you can check it out to make sure that the instruction is right.

    The program worked just fine on the headnode, but when i assigned this 'hello.exe' to work on the cluster nodes, problem occurred with this information:

    Aborting: failed to launch 'test.exe' on PCCluster2
    Error (14001) The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log for more detail.

    I don' t know why...
    Friday, July 25, 2008 3:57 PM
  • Hi Casey,

    This error happens because the debug version of the CRT (C Runtime) is not deployed to the cluster; and you are running the debug version of your app. either,
    1. run the release version,
    2. link with the static version of the CRT,
    3. remove the manifest from your binary (or the manifest file from the Debug folder)

    note that if you are using VS2008, you need a similar solution as the VS2008 binaries are not there by default with Windows HPC Server.

    thanks,
    .Erez
    Sunday, July 27, 2008 3:56 AM
  • Thanks,Erez
    It works! Thank you for helping!
    Monday, July 28, 2008 12:14 PM