Setting to set in Job scheduling so that if the job is not getting the machine for x minutes move to a new machine RRS feed

  • Question

  • Dear All,

    We have enabled auto grow shrink for Azure nodes and it is working awesome. Now we have some issue from Azure infra side that sometimes the allocation failure is happening because of resource crunch in the cluster, due to which the job is not getting executed even though the HPC has been allocated a node.

    Do we have any setting that in HPC that if a Job is not started for x minutes in a machine, can we move that to a different machine with which even if some time the allocation failure happens in Azure infra side the job will check for x minutes and if the machine is not coming online then a different machine is allocated and the job can run on that machine.
    Thursday, April 12, 2018 6:50 AM

All replies

  • Hi, Chandramohandreddy,

      I don't understand your description of "resource crunch". Does it mean the node is always in provisioning state in HPC Pack cluster? If yes, the provisioning will eventually timeout I suppose and then being shrunk by our service.

      And there is workaround to deal these issues:

    Option 1: to grow extra resource in the auto grow shrink setting. Check https://docs.microsoft.com/en-us/azure/virtual-machines/windows/classic/hpcpack-cluster-node-autogrowshrink , there is below option:

    ExtraNodesGrowRatio - Additional percentage of nodes to grow for Message Passing Interface (MPI) jobs. The default value is 1, which means that HPC Pack grows nodes 1% for MPI jobs.

    If all nodes provision successfully, the extra node will keep in idle for 2 minutes, then being reclaimed by the auto grow shrink service.

    Option 2: You mentioned that "move the job the a different machine". I suppose you have specified node names for your job? Otherwise, HPC Pack scheduler will schedule the job to any resource that's available instead of waiting certain machine. As it is in cloud all machines should be identical, thus you shall just specify the resource in the job (Or auto-auto if you're not running MPI job) thus no job will be stuck due to a machine not provisioned in time

    Qiufang Shi

    Tuesday, April 17, 2018 7:51 AM