locked
HPC Nodes not accepting jobs RRS feed

  • Question

  • Hello,

    This past weekend, a user submitted upwards of 4000 jobs to be run on a group of 8 nodes. The same evening a user's daily job was submitted to a different group of 160 nodes (none of the nodes overlap). And yet another user submitted a weekly job the next day that doesn't request a group of nodes, but will take any available.

    When I arrived at the office this morning, the weekly job and the nightly job were stuck in a queued state for seemingly no reason. Additionally, right now in the group of 160 nodes, many of them (about half) indicate that 8 cores are in use according to the cluster manager, despite there being only a dozen single-node jobs in the running state. The rest of the jobs are queued even though half of the group has all cores available.

    We checked the event logs and parsed the binary HPC logs and nothing really stood out to us.

    What might have caused this? How can we further troubleshoot? What is the best way to prevent these jobs from queuing when there are resources available?

    Thank you

    Monday, July 18, 2016 10:31 PM

Answers

  • 1. You can check command "node listcores" to check all the core status and what job/task are actually occupying the resource.

    2. For the nightly job and weekly job, you can check their pending reason, whether it is "not enough resources"

    3. And I suppose most of the 4000 jobs are still in Queued state? By default there is a backfilling setting (1000 jobs) in the queue which means from performance point of the, the scheduler only searches 1000 jobs from the queue to check whether there is job can be scheduled for the free resource. You can change that to bigger number or the entire queue through "job scheduler configuration --> Backfilling".

    4. A better way is to educate your user not to flood the scheduler system. And we are going to provide samples to prevent this from happening.


    Qiufang Shi

    Tuesday, July 19, 2016 2:46 AM