locked
Issues with Adjust resources automatically RRS feed

  • Question

  • Hello,

    We are using windows HPC pack 2012 R2. We have one lower priority job J1, which has RunUntilCanceled = true, and has minimum core = 7 and maximum core = 8, and is able to pre-empt. As this job is never ending, we are adding some tasks dynamically to this job as well. There is another job J2, which is highest priority job and requires 1 core, and has a single task.

    In our scenario, first we schedule J1 on some machine, M, having 8 cores. The job J1 starts executing and takes 8 cores. Now we schedule J2 on the same machine M.  Our scheduling policy is "Immediate pre-emption with Task level pre-emption". So, Job J2 gets one core to run, and for Job J1 we see the message "Allocation reduced to 7 cores" in the Activity log. Soon Job J2 finishes it's execution and releases the core it acquired.

    The problem is that now job J1 doesn't get the core back which was freed by Job J2. This is not correct behavior. We expect that when we select all the options of "adjust resources automatically", the lower priority job should be able to grow it's resources if those resources are available.

    Please help us here. There are two more questions below.

    1. How does HPC choose which task to cancel from J1 during pre-emption?

    2. I believe that the cancelled task in point 1 above, will be re-run by HPC automatically. Is it correct?

    Thanks,

    Puneet


    Puneet Sharma

    Tuesday, April 18, 2017 10:13 AM

All replies

  • Hi,

      For the first problem it should work as it is an supported scenario in our system. Thus please check:

    1. In the policy setting, you checked "Queued" mode, and checked "Increase Resrouces automatically.

    2. In order for the job to grow back to 8, please make sure at that the job has more than 7 active tasks. Because when we grow we grow based on the job's current calculated max cores. I suppose in your case, your job won't enough tasks to absorb the additional one core

    3. And please be noted that there is a grow interval, it usually takes couple seconds for the job to grow

      For the second question, I don't believe this is predictable now but I can take a check.

      For the third question, yes, it will be rerun. But the system will also respect:

      - whether the task is rerunnable

      - whether the task "AutoRequeueCount" has reached the system setting "TaskRetryCount=3(default)"


    Qiufang Shi

    Wednesday, April 19, 2017 1:59 AM
  • Hi Qiufang,

    Thanks for the quick response. I tested the feature of automatically growing and shrinking the job resources and found out that if 'RunUntilCanceled' flag sets to true then job's resource doesn't grow automatically after they have been reduced by the higher priority job. However, if  'RunUntilCanceled' flag sets to false, then job's resources grow and shrink automatically. Is this observation correct? Why job's resources can't grow and shrink automatically when 'RunUntilCanceled' flag sets to true.

    Thanks,

    Puneet


    Puneet Sharma


    Wednesday, April 19, 2017 9:49 AM
  • Okay, this might be a bug:).

    RunUntilCancelled usually used for reserving resources for tasks adding later. And we didn't include this type of job in auto-grow shrink testing. I'll log a bug in our system.

    At this time, if you do really need grow, you can try an alternative solution:

    1. Submit job with 7~8 in your case

    2. Add a monitoring task so that the job won't finish automatically

    3. Implement your own logics in the monitoring task such as monitoring some event to exit gracefully. Or you can finish the task through API from external agent/code logics so that the job will finish automatically.


    Qiufang Shi

    Thursday, April 20, 2017 12:25 AM
  • Thanks Qiufang. Your suggested approach worked. 

    Puneet Sharma

    Thursday, April 20, 2017 8:07 AM