locked
Requeue jobs with node preparation tasks RRS feed

  • Question

  • Hello,

    I am using HPC 2012.
    I am having troubles requeueing a job containing node preparation tasks.
    Here is the exact scenario.
    1. First time the job runs the node preparation task fails on ALL nodes (I made it fail on all nodes deliberately for the purpose of the test)
    2. I requeue the job using the C# client using the following methods

    IScheduler.ConfigureJob(jobId);

    IScheduler.SubmitJobById(jobId,...);

    3. The requeue fails (before having a chance to run the failed node prep again) witht he following exception

    Microsoft.Hpc.Scheduler.Properties.SchedulerException: This job requires at least 1 cores, but the list of candidate nodes that the Job Scheduler service returned for this job contains only 0 cores. The Job Scheduler service determines the candidate node list using the following job properties: NodeGroup, RequestedNodes, MinMemoryPerNode, MaxMemoryPerNode, MinCoresPerNode, MaxCoresPerNode, and ExcludedNodes. Either reduce the number of resources that the job requires, or redefine the relevant job properties, and then submit the job again.

    4. When I check the status of the node prep task in HPC cluster manager I see that it is cancelled with the reason :

    This sub-task was canceled because it could not be requeued along with the rest of the job.  Another sub-task will be created to replace it.

    My understanding was that requeueing a job would requeue all failed tasks in that job, so why is the node prep task not run again ?

    Thanks


    Wednesday, July 12, 2017 3:00 PM

All replies

  • Hi,

      This is expected, the issue you're seeing is that your job failed to run the node prep task, and all those resources that failed to run will be in the job's ExcludedNodeList (You can check it through job view <jobId> /detailed). And requeueing the job won't clear this list. You need to do "job modify <jobid> /clearexcludednodes or /removeexcludednodes" to before you requeue the job.

      And the error message of "This sub-task was canceled because it could not be requeued along with the rest of the job.  Another sub-task will be created to replace it." Means that it is not necessary to requeue these canceled sub-tasks as when requeued, new sub-tasks will be generated to run on resources that allocated to the job (as defined behavior for "Node Prep" task).

      Hope this explans the why to you.


    Qiufang Shi

    Thursday, July 13, 2017 6:49 AM