locked
Jobs blocking when queuing for different resources RRS feed

  • Question

  • Hi all,

    Another request for ideas with "strange queuing behaviour" in HPC 2012.

    I have two job templates called "24Core" and "GeneralNodes". Jobs submitted with the 24Core job template are forced to run on nodes in a "24Core" nodegroup, whereas jobs submitted with the "GeneralNodes" job template are forced onto the "OtherNodes" nodegroup - there is no overlap between the nodegroups. So basically I have two sets of computers for different tasks that I want to keep separate.

    I then start with a totally empty cluster, and submit 1000 jobs to the "24Core" job template. The first 96 run (as I have 4 x 24-core nodes), and the rest queue nicely. However, if I then submit a job with the "GeneralNodes" job template, it queues, even though there are plenty of free nodes in the "OtherNodes" group to run it on straight away. It seems the jobs submitted via the 24Core job template have reserved cores on nodes in the "OtherNodes" group, even though the "24Core" job template doesn't let the jobs run on those nodes.

    The cumbersome workaround, which proves what's going on, is to manually lower the priority of the jobs queuing on the 24-core nodes (or raise the priority of the jobs submitted to the OtherNodes.

    But this shouldn't be necessary. It seems to me even having different resources for different templates is broken in HPC 2012, which is a fairly *major* problem for general cluster use. We don't really want a separate head-node for every different-purposed group of nodes, but I can't see another way.

    Ideas/help gratefully received as always,

    Wes


    • Edited by WesHinsley Wednesday, May 1, 2013 4:08 PM Clarify description
    Wednesday, May 1, 2013 4:05 PM

All replies

  • 1000 is a bit of a magic number in that it is the default for BackfillLookAhead

    Review that feature.  You might be running in to it.

    Or you could try making only 850 24Core jobs :)

    d

    Wednesday, May 1, 2013 10:33 PM
  • I can reproduce this problem either on Windows Server HPC 2008 R2.Additionally job.exe application is crashing due to runtime and CLR errors.

    My test :

    1000 jobs -  job submit /nodegroup:group1 calc.exe

    1000 jobs -  job submit /nodegroup:group2 calc.exe

    Settings:

    Each group have 2 different compute nodes , BackfillLookAhead:1000, BackfillLoadPeriod:30 , no job template


    Daniel Drypczewski

    Wednesday, May 8, 2013 6:58 AM
  • Hi Daryl,

    We have backfilling switched off - but the 1000 cores is abritrary - the most simple example of the problem arising is where there are more jobs queuing than there are total cores on the cluster.

    Eg, suppose my cluster has 96 cores as 4 nodes in a 24core group, and, say, another 200 in another group on different mixed nodes. If I submit 296 jobs, set to only run on the 24-core group, on an empty cluster, then 96 will run, 200 will queue. If I now submit jobs to run on the "mixed node group", they will queue, even though all 200 cores in that group are idle.

    When the first of the 96 running jobs finishes, another starts up. So now I have 96 (in the 24-core group) busy, 199 jobs queuing for the 24-core nodes. And then only *one* of my "run anywhere" jobs starts running. It looks like the startup of the job targetted on the "24-core" group released a lock on some resources on the "mixed group", which would never have been appropriate for that job - thus allowing the first of my "mixed group" jobs to start running.

    Wes
    Thursday, May 30, 2013 8:07 AM
  • Hi all,

    At risk of being repetitive, can I just emphasise how fundamental and critical this problem is to a cluster manager? Simple re-statement of the problem:- backfilling is all switched off. This is absolutely basic queuing we're talking about.

    NodeGroup A has X cores total.
    NodeGroup B has Y cores total.
    Suppose NodeGroup A is entirely busy running jobs, and there are Z more jobs queuing for NodeGroup A, and after that, some jobs queuing for NodeGroup B.
    Then maximum number of jobs that can start up on NodeGroup B is (Y-Z) - the others will queue for resources that are actually idle.
    The workaround is to manually raise the priority of jobs queuing for idle cores on NodeGroup B to above the priority of the Z jobs, which is extremely tedius.

    Can anyone think of any better workarounds to handle this really common behaviour - or is there any hope that this might be addressed in a (preferably urgent) service release?

    Thanks,

    Wes


    • Edited by WesHinsley Wednesday, June 19, 2013 11:10 AM typo
    Wednesday, June 19, 2013 11:09 AM
  • Hi all,

    Could I "bump" this question again - it is still an issue, and becoming more serious as HPC usage increases in my university department. The problem still exists in HPC 2012 R2, v 4.2.4400.0 - there are more recent updates, but the release notes make no mention of addressing this problem, and upgrading over 100 nodes across two clusters is a major job.

    To restate the problem yet again: Suppose you have 5 nodes in group A, 5 nodes in group B, and job templates that allow jobs to be run exclusively on those two groups. Now we submit some single-node jobs. Suppose a user submits 6 jobs to group A, 5 start running and 1 queues, which looks ok. But if straight after that, a user submits 6 jobs to group B, then only 4 of those jobs start running; the other 2 queue, and one node in group B stays idle. It seems that the 6th job submitted to group A stops a node in group B from taking on jobs, even though it can never actually run on that node, because the job template/node group prevent it.

    The workaround in this example is to raise the priority of just ONE job targetting group B - then the idle node takes on that job. But if I raise the priority of more than one job, then I might get nodes idle in group A caused by a job that will only run on group B... so I need to only raise priority of jobs when there are idle nodes ready to take those jobs. This clearly isn't scalable for 100+ nodes and 100+ users, who I look after.

    It is a crucial problem to us, and will soon force our whole department away from MS HPC if we can't address it. Any ideas greatly appreciated.
    Thanks,
    Wes

    Thursday, November 12, 2015 12:56 PM
  • I cannot simply repro this with HPC Pack 2012 R2 Update 1 version 4.3.4652.0. Could you update to this version or Update 2 version 4.4.4864.0 and have a retry?

    Btw, what's the scheduling policy you were using? Queued mode with graceful preemption? Have you enabled the resource pool feature for the cluster?

    BR,

    Yutong Sun

    Thursday, November 12, 2015 3:10 PM
  • OK - many thanks for testing this. I'll schedule an update to 4.4.4864.0 when I can, and see if things improve.

    Scheduling policy for us is Queued, with no pre-emption, and resource pools are disabled. We just wanted the most simple queue possible.

    Thanks,

    Wes


    • Edited by WesHinsley2 Thursday, November 12, 2015 3:17 PM
    Thursday, November 12, 2015 3:15 PM
  • FYI. We are releasing Update 3 this week. Please stay tuned.

    BR,

    Yutong Sun

    Friday, November 13, 2015 2:25 AM
  • Hi,

    I've updated the cluster to U3, and still have the issue that I can demonstrate right now.

    We have a large set of "General Nodes", and some "Special Purpose" nodes, in distinct node-groups, with different job templates to allow sending jobs to those node groups. You cannot launch a job that can potentially run on either group, because we haven't allowed it in the job templates, and there are no nodes that are members of both groups. We want totally separate access to the two groups basically.

    Right now, users have submitted several thousand single-core jobs for my "General Nodes", which are all fully busy, but my "Special Purpose" nodes are completely idle. I've launched a single-core job to run on my "Special Purpose" nodes, and it is just sitting there Queuing, even though there is no one using those nodes, and no job earlier in the queue will ever use them either.

    It will sit there until the number of jobs still queuing for the "General Nodes" queue is less than the number of idle cores I have in the "Special Purpose" queue. Then it will start running. If I raise the priority of my "Special Purpose" job, then it will run immediately, but that's a hacky way to solve it. I'd rather HPC handled queues for distinct resources properly.

    Wes


    • Edited by WesHinsley2 Saturday, January 9, 2016 9:01 AM typos
    Saturday, January 9, 2016 9:00 AM
  • Hi Wes,

    Could you try the backfilling option Allow backfilling from the entire queue and see if it works for you? If not, could you run 'Get-HpcClusterProperty -Parameter' to show the parameter list for repro reference.

    BR,

    Yutong Sun

    • Proposed as answer by WesHinsley2 Wednesday, February 10, 2016 12:53 PM
    Monday, January 11, 2016 12:47 PM
  • Hi,

    I've set the backfilling option - although I should point out we don't have maximum job times on any of the jobs, and we don't want to impose those.

    Below is the output from the Get-HpcClusterProperty -Parameter.

    Thanks, W.

    ---

    PS C:\Windows\system32> Get-HpcClusterProperty -Parameter

    Name                                     Value
    ----                                     -----
    SpoolDir                                 \\FI--DIDEMRCHNB\CcpSpoolDir
    AllowNewUserConnections                  True
    InactivityCount                          500
    InactivityCountAzure                     500
    HeartbeatInterval                        600
    SubmissionFilterProgram
    SubmissionFilterTimeout                  15
    ActivationFilterProgram
    ActivationFilterTimeout                  15
    TtlCompletedJobs                         14
    JobRetryCount                            0
    TaskRetryCount                           0
    BackfillLookAhead                        -1
    BackfillLoadPeriod                       30
    NodeReleaseTaskTimeout                   15
    AutomaticGrowthEnabled                   False
    AutomaticShrinkEnabled                   False
    PreemptionType                           None
    AffinityType                             NonExclusiveJobs
    JobCleanUpHour                           2
    SchedulingMode                           Queued
    PreemptionBalancedMode                   Immediate
    TaskCancelGracePeriod                    15
    PriorityBiasLevel                        0
    PriorityBias                             MediumBias
    ReBalancingInterval                      10
    DefaultHoldDuration                      900
    ExcludedNodesLimit                       10
    EmailNotificationEnabled                 True
    EmailSmtpServer                          automail.cc.ic.ac.uk
    EmailFromAddress                         dide-monitoring@imperial.ac.uk
    EmailUseSsl                              False
    DisableCredentialReuse                   False
    HpcSoftCard                              Disabled
    HpcSoftCardTemplate
    SoftCardExpirationWarning                5
    SchedulerWebServicePort                  443
    SchedulerWebServiceEnabled               False
    SchedulerWebServiceThumbprint
    SchedulerWebServiceAuth                  Basic
    EnablePools                              False
    GrowByPreemptionEnabled                  True
    TaskImmediatePreemptionEnabled           True
    NettcpOver443                            True
    AzureLogsToBlob                          Disabled
    AzureLogsToBlobThrottling                1
    AzureLogsToBlobInterval                  5
    GetAzureBatchTaskOutput                  False
    ScanAzureBatchTaskWithoutFilterInterval  60
    CollectCounters                          True
    MinuteCounterRetention                   3
    HourCounterRetention                     30
    DayCounterRetention                      180
    AzureStorageConnectionString
    AzureLoggingEnabled                      False
    AzureMetricsCollectionEnabled            False
    AzureMetricsJobStatisticsDelayMinutes    5
    ClusterId                                7da32208-a327-4bf0-8125-ad25d9556f68
    TtlCompletedRuns                         5
    RunCleanUpHour                           3
    ConcurrencyTestRunNumber                 5
    DataExtensibilityEnabled                 True
    DataExtensibilityTtl                     365
    AllocationHistoryTtl                     5
    OperationArchive                         7
    OperationRetention                       180
    AzureIaaSMetricsCollectionEnabled        True
    ReportingDbSize                          3051.56 MB
    RestoreMode                              False

    Thursday, January 14, 2016 5:35 PM
  • Hi,

    An opportunity to test this came today, and indeed, turning on "Allow backfilling from entire queue" worked - EVEN THOUGH maximum runtime is not set on any of the job templates.

    More detail: I had over 20,000 jobs - 800 running, the rest queuing, for my "GeneralNodes" group, and an entirely idle "20core" group. I submitted a job to the 20core group. If "Do not allow backfilling" is ticked, the job kept waiting, whereas if "Allow backfilling from the entire queue" was ticked, it ran, despite all the queued jobs ahead of it (for the different resources).

    So, I'm not sure whether this is "expected behaviour" or a "workaround" - I didn't think backfilling was for this purpose, and I didn't expect it to work anyway without the maximum run time set. But as a workaround, it seems to do the trick.

    Thanks,
    Wes

    Wednesday, February 10, 2016 12:53 PM