Submitting job on HPC (distributing between all nods, currently is only running on the first available) RRS feed

  • Question

  • I have a HPC setup, one head node and two workers (this is just a test for bigger project that we are carrying here at TTU).

    we have program that has to receive some pictures and process those and make newer picture from those. as of right now submitting job is fine and i can get all of the cores (16) to run the application on the head node or one of the workers. but out ultimate goal is to get all the nodes to run this application simultaneously, in the word i was wondering to submit jobs like submitting to SSI system (Single System Image) although i know that HPC doesn't have the same architecture as SSI does but it will be great if there is anyone who can help me with this matter.

    P.S when i assign other nodes for the job to be run on only the first available node will receive the job and the rest of the system including the headnode will stay steady.

    Thanks for any advice.

    Wednesday, July 31, 2013 5:17 AM

All replies

  • Is it MPI program? Do you use "job submit" ,clusrun to dispatch your jobs?Are firewalls on your worker nodes off?

    Daniel Drypczewski

    Thursday, August 1, 2013 4:26 AM
  • the code is MPI, i also use the job submit to submit the job, as of firewall goes i will check on that but im sure that i tried it when it was off.
    Thursday, August 1, 2013 5:27 PM
  • At first let's confirm connectivity bewteen nodes in your cluster configuration.

    Say , you have two 16-core nodes (NODE1,NODE2) and head node is NOT configured as compute node (I suggest not to use head node for computation) .When you run the below command you expect to run your job on all available cores in the cluster (32 cores)

    job submit /numcores:32 mpiexec calc.exe

    To confirm how many instances of calc.exe are running on the nodes type

    tasklist /s NODE1 | find /c "calc"  --> should be 16

    tasklist /s NODE2 | find /c "calc"  --> should be 16

    It it is not 16 in both cases that means there is maybe a connection problem

    You can kill running calc.exe on your nodes by typing

    taskkill /s NODE1 /f /im calc.exe

    taskkill /s NODE2 /f /im calc.exe

    If you see 16 in both cases the cluster configuration is fine and you should check if your application is able to run on all the cores.For example if the tasks are short, currently computing core is freed fast enough before next tasks arrives and is used again giving the impression that some cores don't work as you would axpect.

    Daniel Drypczewski

    Friday, August 2, 2013 2:06 AM
  • I did the above commands and i saw what you specified, the setting is working fine. is there any package that could make HPC to behave as Single System Image does?
    Friday, August 2, 2013 4:41 PM