locked
job holding all of its assigned resources even though most tasks are finished? RRS feed

  • Question

  • We running 2012R2, and had a question re: resource allocation.  We have 320 workers, and kicked off a SOA job with MaxUnits assigned to 310 (unit = node, exclusive = true), with 600 total tasks.    599 out of 600 of these tasks finished, but when I check the Job monitor and view jobs, it shows all 310 of the tasks (1.1-1.310) as running.   While this was happening, we kicked off a second job with the same configuration as the one above, and it only received 10 machines (320 total - 310 still assigned to the old task).  Is that correct?  I would have expected to only see 1 task running when looking at the first job, and 319 machines available to be assigned to the second task (and it would take 310 of them since that was the number available).


    Thursday, December 26, 2019 6:22 PM

All replies

  • thanks, grow/shrink does not appear to be configured.  I'll get that set and see if that resolves the issue.    We do not explicitly have MinUnits set at all.  allocationAdjustmentInterval is set to 15000.  Will changing the EnableGrowShrink allow it to go over the MaxUnits?  Or just allow it to take on additional resources if they become available, up to the MaxUnits configured?

    PS C:\Program Files\Microsoft HPC Pack 2012\Bin> Get-HpcClusterProperty -AutoGro
    wShrink

    Name                                     Value
    ----                                     -----
    EnableGrowShrink                         False
    TasksPerResourceUnit                     1
    GrowThreshold                            1
    GrowInterval                             5
    ShrinkInterval                           5
    ShrinkIdleTimes                          3
    ExtraNodesGrowRatio                      1
    GrowByMin                                False
    SoaJobGrowThreshold                      50000
    SoaRequestsPerCore                       20000


    Friday, December 27, 2019 12:56 PM
  • Hi Jason,

    By job grow/shrink, I did not mean the Auto Grow Shrink feature of the cluster. It is a job scheduler behavior to increase or decrease resource allocation to a job according to the tasks (or SOA requests) of the job. Please check this doc for details. You may simple open the scheduler options in the HPC Cluster Manager or just run 'cluscfg listparams' to view these configurations.

    If you already enabled the job grow/shrink and set the allocationAdjustInterval for SOA job as 15,000 ms, then the SOA job may need 15 seconds before shrinking the tasks after the requests are done.

    Regards,

    Yutong Sun

    Monday, December 30, 2019 6:43 AM