locked
Widows HPC auto grow and shrink algorithm RRS feed

  • Question

  • Dear all,

    Would anybody let me know the internal algorithm supported by window HPC auto grow and shrink feature on Azure (https://docs.microsoft.com/en-us/azure/virtual-machines/windows/classic/hpcpack-cluster-node-autogrowshrink) ? Is there any documentation describing this algorithm?

    Thanks,

    Puneet 


    Puneet Sharma


    Tuesday, January 9, 2018 8:49 PM

All replies

  • this doc hasn't been updated yet to latest. I can simply put the algorithm here, and if you have future questions, feel free to reach us through hpcpack@microsoft.com.

    Shrink Logic: The service will check all azure nodes (exclude the nodes within "ExcludeNodeGroups" specified by the cluster admin) every ShrinkInterval (default 5 minutes), if there is no job/task running on it, mark it as idle; if a node being marked as idle for ShrinkIdleTimes (default 3), the node will then be torn down, and move the node to "Not-Deployed" state

    Grow Logic: The service will:

    1. First read the current job queue in order (Queued and Running jobs), every job has Minimum Required Resource and Maximum Required Resource and its currently used resource. (If GrowByMin is set to $false, then we only read the Maximum Required Resource so that the job can finish as soon as possible); Thus we can calculate the ResourceNeedToGrowForJob (It may specified with node group info)

    2. We check the "Provisioning" and "Not-Deployed" azure resource, And pick resource number "ResourceNeedToGrowForJob" calculated in step 1, and mark them to grow

    3. Repeat step 1 and 2 until there is no job to read or there is no more "Provisioning" and "Not-Deployed" resource to mark, kick off the "Grow" which will start the "Not-Deployed" azure nodes if it is already deployed; or provisioning the "Not-Deployed" nodes if this is the first time to start. And bring online the nodes when the nodes are ready.

    Please be noted that

    1. the scheduler will know the minimum and maximum needed resource for a job. If a job is on-hold or depending on other jobs, the service will skip these jobs. 

    2. There are possibilities of provisioning failure if you deploy hundreds of nodes, and you also have a big job to run, say a MPI job with minimum-maximum node is set to 100, under this situation, you might configure the "ExtraNodesGrowRatio" (default is 1, means if the job requires 100 nodes, we will try to deploy 101 nodes).


    Qiufang Shi


    Wednesday, January 10, 2018 6:36 AM
  • Worth noting, after you enabled auto grow shrink "Set-HPCClusterProperty -EnableGrowShrink 1", you will see the grow and shrink decision logs in the ClusterManagerGUI->ResourceManagement Pane--> Operations->AzureOptions. 

    Qiufang Shi


    Wednesday, January 10, 2018 6:39 AM