none
drain nodes on task level

    Question

  • How do I tell an HPC node not to take any more tasks?

    If I use Set-HpcNodeState nodes are apparently waiting for running jobs to complete.

    Tuesday, September 23, 2014 10:52 AM

All replies

  • HI Thomas,

      thanks for the question. Did you ever try the "-force" option from the cmdlet? It will force the node to be offline which means the tasks running on that node will be cancelled and requeued on other nodes. Without "-force" the node will be in draining state and waiting the running tasks to be completed without losing data and no tasks will be scheduled on this node.

    Qiufang


    Qiufang Shi

    Thursday, September 25, 2014 2:24 AM
  • Hi Qiufang,

    Thanks for taking your time to read and reply to my question.

    I am aware of the "-Force" option which is well documented.

    Without the "-Force" option Windows HPC Server 2008R2 do not start any new job on the node (as documented). However, it continues to start new tasks from the already running jobs.

    I am looking for a way to let a node gracefully complete running tasks without starting any new tasks.

    /Thomas

    Thursday, September 25, 2014 6:09 AM
  • Hi

    I am not an expert, but my hunch is this is by design. The Job Scheduler deals with jobs as its smallest unit, and will not reschedule individual tasks to other nodes unless the node goes offline. At that point the job is in jeopardy of failing, so it resubmits that task to another available node ... but there is no mechanism AFAIK to repurpose a job that's running to another node.

    One way to do what you want would be to make your workload one (or, the minimum number needed) tasks per job.

    The only other way I can think of to achieve this - and it depends on how much control you have over the task creation, if an application is creating the workflow you may be outta luck - is to interleave your worker task with 'watchdog' tasks.

    Consider a job with three tasks:

    Task1
    Task2
    Task3

    Now change this chain to be ...
    Task 1
    WatchdogTask
    Task2
    WatchdogTask
    Task 3
    WatchdogTask

    Write the watchdog task so that it checks to see if the current node is draining, and if so it offlines the node immediately using -Force. I.e. the task offlines the node itself.

    Next task will then be rescheduled to another node. You might have to play about with making the watchdog task wait and do nothing at the end of the task to give the cluster manager time to offline the node... Or come to think of it the watchdog could be as simple as:

    - Am I draining?
    - If so, force myself offline
    - Wait

    The cluster manager offlines the node ... the currently running task [the watchdog] is rescheduled (on another node) and on the new node, the 'Am I draining?' check fails, and it goes onto next task.

    Caveat: This may mean you need to grant rights to offline nodes to the users submitting jobs. But there are workarounds for that too. You can create a scheduled task that runs as an appropiate user on the compute node, and then grant cluster users the rights to start the job... the watchdog task just kicks off the scheduled task.

    I'd be interested to know if anyone manages to make this untested off the top of my head idea work ;)

    Malcolm

    • Proposed as answer by ma.ma Saturday, October 11, 2014 10:34 PM
    Saturday, October 11, 2014 10:25 PM
  • Hi Malcolm,

    Thanks for taking your time to look into this. I acknowledge that one task per job would work as described but we have on the order of 3000 similar and independent tasks per job each running for about 30 minutes, and are keen to keep them organized as jobs as this serves other purposes (better than the project attribute would do).

    I do not see the watchdog task work flow working if there are more than one task running at the node? In my situation with 32 cores per node it would potentially cause the watchdog task and 31 normal tasks being forced to reschedule?

    The behavior that HPC exhibit might be by design but I do not understand and thus appreciate why it has to be that way.

    Any other ideas to make my suggestion possible (e.g. behind the scene with a database trigger on the Scheduler database or the like)?

    Best regards,
    Thomas

    Monday, October 13, 2014 8:01 AM