none
Jobs run on head node only RRS feed

  • Question

  •  

    Hello, 

         I am a student trying to set up a 3-node HPC cluster.  The cluster is configured, but when a job is run, it will only run on the head node.  The Job Manager Activity Log for the job will show that it made two attempts to contact the compute nodes, showing that the job "started" and "ended" on the nodes, but the "Hello World" output shows that the job is only running successfully on the head node (also, if I request 3-3 nodes, the job will fail.  The job described above was using either auto request or a minimum of 1 nodes).  

    More Info:

    My network topology is "5: enterprise network".  In the Cluster Manager, all the nodes are shown as "online", but do have warnings from diagnostic tests (I don't currently have privileges to run diagnostics, so I don't have any more detail on these, but I could get in contact with the admin about it.)

    Hardware Details:

    Head node is Intel Pentium 4 (2 cores), Compute nodes are Intel Xeon (4 cores). 

    Software Details (same on all 3 nodes):

    HPC Pack 2008 R2 (job submitted both using Job Manager and HPC Powershell)

    Programming in Microsoft Visual Studio 2012 in C++ using MS-MPI, OS is Windows Server 2008 R2 Enterprise 64 bit. 

    Can anyone offer any suggestions on why this is happening and what could be done to fix it?  Thanks for your time,

    Lucas

    Monday, October 7, 2013 2:26 AM

Answers

  • Hi Lucas,

    A couple of things:

    1. If you want to run a multi-node MPI job, you will need to start your MPI application with mpiexec.  If you invoke your MPI executable directly, you will have a singleton MPI job (or multiple singletons if you run on multiple cores/nodes).  Your commandline should be "mpiexec helloworld.exe", which should result in one process for each core you requested (in your example above you would have 10 instances all working together).  If you request nodes instead of cores, you will need to add the -n or -c parameter to your mpiexec commandline to specify how many instances of the program to run, as by default you will have one per node.

    2. You will need to make sure that the firewall is configured to allow your application to work as a multi-node MPI job, since I suspect you have the firewall enabled on your enterprise (and only) network.

    Have you validated that multi-node MPI diagnostic tests work?  You can try "job submit /numnodes:3 mpiexec mpipingpong -pl -op nul", which should give you your latency between nodes.  You can view the task output with "task view <jobid>.1"

    Hope that helps,
    -Fab

    • Marked as answer by LT 3 Tuesday, October 15, 2013 6:14 PM
    Thursday, October 10, 2013 6:39 PM
  • Hi,

        Update: I was able to iron out the rest of the issues.  There were 2 details that I had overlooked that solved the problem:

    1. The account submitting the job needed to have local log-on permissions for each of the nodes.

    2. The account needed to have writing permissions for the current working directory.

    When I fixed those two details, the jobs ran exactly how I wanted them to (using both the command line in the comments above and the Job Manager GUI).

    Thanks again for the help, Fab. 

    • Marked as answer by LT 3 Tuesday, October 15, 2013 6:14 PM
    Tuesday, October 15, 2013 6:11 PM

All replies

  • Hi Lucas,

    What is your job submit/mpiexec command?

    If you request 3 nodes, what is the error you are getting?

    Thanks,
    -Fab

    Monday, October 7, 2013 10:35 PM
  • Hi Lucas,

    What is your job submit/mpiexec command?

    If you request 3 nodes, what is the error you are getting?

    Thanks,
    -Fab

    Hi Fab,

       I've used the Job Manager GUI more than the command line, but I get the same results either way.  My commands (sorry, i was getting an error message when I requested NumNodes, so I did an equivalent NumCores):

    >$MyJob = New-HpcJob -Name "testjob" | Add-HpcTask -commandline "helloworld.exe" -workdir "C:\" -stdout "testjob_out.txt" -NumCores "1-10"

    >Submit-HpcJob -id 350

    When I run the job like this, the job will submit, run, and finish.  However it only runs on the head node, and the output file will say "rank 0 of a 1 processor job".  If I change the request to -NumCores "10-10" (the total for the 3 nodes), I will get an error that the job failed because "Job failed to start on some nodes or some nodes became unreachable". 

    Thanks again,

    Lucas

    Thursday, October 10, 2013 5:32 PM
  • Hi Lucas,

    A couple of things:

    1. If you want to run a multi-node MPI job, you will need to start your MPI application with mpiexec.  If you invoke your MPI executable directly, you will have a singleton MPI job (or multiple singletons if you run on multiple cores/nodes).  Your commandline should be "mpiexec helloworld.exe", which should result in one process for each core you requested (in your example above you would have 10 instances all working together).  If you request nodes instead of cores, you will need to add the -n or -c parameter to your mpiexec commandline to specify how many instances of the program to run, as by default you will have one per node.

    2. You will need to make sure that the firewall is configured to allow your application to work as a multi-node MPI job, since I suspect you have the firewall enabled on your enterprise (and only) network.

    Have you validated that multi-node MPI diagnostic tests work?  You can try "job submit /numnodes:3 mpiexec mpipingpong -pl -op nul", which should give you your latency between nodes.  You can view the task output with "task view <jobid>.1"

    Hope that helps,
    -Fab

    • Marked as answer by LT 3 Tuesday, October 15, 2013 6:14 PM
    Thursday, October 10, 2013 6:39 PM
  • Hi Fab,

       Thanks! When I ran:"mpiexec -n 3 helloworld.exe", I got the three computers to communicate.... (i.e. it says "part of a 3 processor job" in the output).

       However, when I use the "job submit" command you suggested for the diagnostics, I have the same errors I had before.  I tried "job submit /numnodes:3 mpiexec mpipingpong -pl -op nul" like you suggested, and I also tried "job submit /numnodes:3 mpiexec "helloworld.exe".  When I run it this way, both jobs fail with the same message I mentioned above: "Job failed to start on some nodes or some nodes become unreachable".  (But I was able to successfully run it on one node again: "job submit /stdout:hello-out.txt /requestednodes <headnode> /workdir:"C:\" mpiexec -n 1 helloworld.exe" )

       Do you have any more suggestions for how I might be able to resolve this and get jobs to submit correctly?

    Thanks again, your advice has been very helpful so far,

    Lucas

    Monday, October 14, 2013 1:39 AM
  • Hi,

        Update: I was able to iron out the rest of the issues.  There were 2 details that I had overlooked that solved the problem:

    1. The account submitting the job needed to have local log-on permissions for each of the nodes.

    2. The account needed to have writing permissions for the current working directory.

    When I fixed those two details, the jobs ran exactly how I wanted them to (using both the command line in the comments above and the Job Manager GUI).

    Thanks again for the help, Fab. 

    • Marked as answer by LT 3 Tuesday, October 15, 2013 6:14 PM
    Tuesday, October 15, 2013 6:11 PM