none
HPC 2012 numnodes:1 problem RRS feed

  • Question

  • Hi,

    Can anyone point me in the right direction for this problem? We have a HPC 2012 server setup, with around 50 nodes. I am submitting a test job to the cluster while it is almost empty - lots of totally idle nodes.

    When I submit with job submit /scheduler:hpc /numnodes:1 ... the job queues, reporting insufficient resources, even though there are many nodes entirely idle.

    When I submit with job submit /scheduler:hpc /numcores:12 ... (Where the nodes are 12-core), the job runs straightaway, using all the cores on a 12-core node.

    Any ideas on how I should go about fixing this?

    Thanks,
    Wes

    Monday, April 8, 2013 2:42 PM

Answers

  • SOLVED!

    You have to add /singlenode:false if you are requesting /numnodes:1, as the default for /singlenode is true, and that specifically doesn't work with /numnodes:1

    Interestingly, if you request /numnodes:2, without saying anything about /singlenode, then you get an error message "A single node job cannot have a minimum requirement of more than 1 node". But you don't get the error message if you have /numnodes:1, without specifying /singlenode.

    I would suggest this is very counter-intuitive, and should be fixed so that /numnodes:1 and /singlenode:true are allowed to work together, since they imply such similar things. Certainly having to specify "false" when you really mean the opposite is not right.

    Thanks for your thoughts Daniel - for info, the HPC log was clear (as we'd expect), and this behaviour didn't happen on HPC 2008 (R2), because that doesn't have support for /singlenode, so my scripts worked fine on our older 2008R2 cluster.

    Wes


    • Marked as answer by WesHinsley Friday, April 19, 2013 12:19 PM
    Friday, April 19, 2013 12:19 PM

All replies

  • How about other numnodes settings for example /numnodes:1-1 , /numnodes:1-5?

    When you set /numnodes:1 HPC scheduler picks up one node of your cluster and runs your job there.If there is any connectivity problem to that node you may end up with job queueing.


    Daniel Drypczewski


    Tuesday, April 16, 2013 1:29 AM
  • No, those /numnodes settings do the same thing - the job gets queued waiting for resources even if there are free nodes. There's no connectivity problem - if I arrange it so there is a single node that is completely free and then use /numcores:12-12 /singlenode:true, then I get the exact node that I wanted to get, but couldn't get using /numnodes:1

    It seems to me it specifically affects /numnodes: requests.

    Wes

    Thursday, April 18, 2013 3:27 PM
  • I don't have HPC 2012 cluster environment , but in HPC 2008 cluster "job submit /numnodes:1 hostname" works fine.Maybe this is an issue with HPC 2012.I assume in your environment you will get you job stuck with queueing status.

    We need somebody with HPC 2012 cluster to confirm whether /numnodes:1 works or not.

    You may also check HPC scheduler log (Event Viewer\Applications and Services Logs\Microsoft\HPC\Scheduler) on your Head Node.


    Daniel Drypczewski

    Friday, April 19, 2013 1:49 AM
  • SOLVED!

    You have to add /singlenode:false if you are requesting /numnodes:1, as the default for /singlenode is true, and that specifically doesn't work with /numnodes:1

    Interestingly, if you request /numnodes:2, without saying anything about /singlenode, then you get an error message "A single node job cannot have a minimum requirement of more than 1 node". But you don't get the error message if you have /numnodes:1, without specifying /singlenode.

    I would suggest this is very counter-intuitive, and should be fixed so that /numnodes:1 and /singlenode:true are allowed to work together, since they imply such similar things. Certainly having to specify "false" when you really mean the opposite is not right.

    Thanks for your thoughts Daniel - for info, the HPC log was clear (as we'd expect), and this behaviour didn't happen on HPC 2008 (R2), because that doesn't have support for /singlenode, so my scripts worked fine on our older 2008R2 cluster.

    Wes


    • Marked as answer by WesHinsley Friday, April 19, 2013 12:19 PM
    Friday, April 19, 2013 12:19 PM
  • Probably I will have the same problem in the future when I update my cluster to 2012 version.

    Thanks for pointing this out.


    Daniel Drypczewski

    Thursday, April 25, 2013 1:47 AM