locked
Azure HPC Cluster AutoGrowShrink Not Triggering RRS feed

  • Question

  • I did setup Azure HPC cluster in Azure with 1 head Node and 2 compute Nodes, which is working fine. I am able to submit job from Excel that gets results back perfectly.

    Now, I am trying to auto scale the cluster, I have uploaded the Cert to Azure subscription and updated the Cert store and Registry on head node, as described here.

    https://azure.microsoft.com/en-in/documentation/articles/virtual-machines-windows-classic-hpcpack-cluster-node-autogrowshrink/

    I have set the AutoGrowShrink property using PS, which seems to setup correctly as well.

    PS C:\Program Files\Microsoft HPC Pack 2012\Bin> Get-HpcClusterProperty -AutoGrowShrink
    
    Name                                     Value
    ----                                     -----
    EnableGrowShrink                         True
    TasksPerResourceUnit                     1
    GrowThreshold                            1
    GrowInterval                             3
    ShrinkInterval                           5
    ShrinkIdleTimes                          3
    ExtraNodesGrowRatio                      1
    GrowByMin                                True
    SoaJobGrowThreshold                      50000
    SoaRequestsPerCore                       20000
    

    However, when I submit the job that requires more cores, it gets queued up for ever and new compute nodes are not being added to the HPC cluster (AutoGrowShrink not triggering). Any idea what I may be missing.

    Wednesday, October 5, 2016 1:20 PM

Answers

  • Issue has been related to the Azure load balancers timing out.

    We finally used Azure storage queue in communication between HPC client and head node, we can specify the queue in the open session command.

    Thursday, December 8, 2016 10:17 PM

All replies

  • Hi, Sajad,

    Hpc auto grow shrink cannot auto add compute nodes to cluster, you need add the nodes to your HPC IaaS cluster first, then if you enable auto grow shrink, it can stop the nodes when no job, and start the nodes when new job is coming.

    When the node is stopped, the azure VM will be in Stopped(Deallocated) state, you are no longer charged for the virtual machine.

    How do you deployment your HPC cluster, user ARM template, or deployment script?

    if you use deployment script, you can still use same way to add nodes

    https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-windows-classic-hpcpack-cluster-powershell-script/

    or use inbox script to add node, https://blogs.technet.microsoft.com/windowshpc/2014/12/02/automating-hpc-cluster-deployments-in-azure-iaas-part-ii-azure-vm-nodes-management/

    if you use ARM template, you can still use the ARM template to add nodes

    Thanks,

    Yongjun

    Saturday, October 8, 2016 1:59 AM
  • Thanks Yongjun!

    I was able to figure this out. However, I am facing a different issue now.

    I have AutoGrowShrink enabled on a 50 node cluster, when I submit the job (Excel Offloading), it starts provisioning the nodes, executes the task, etc. Although on the head node, job finishes 100 % but job state is still Running and there is no update received by the Excel on client machine.

    However, when I run the job when computes nodes are already provisioned/online, everything works perfectly fine.

    Initially, I was getting timeout exception, which I resolved by increasing the Client timeout and Client Idle timeout Broker settings. Now there are no timeout exceptions but client doesn't receive any responses either, it is in a kind of hung state now, the HPC_Merge sub, which process the server responses is not getting called.

    Any idea what could be the issue.


    Monday, October 24, 2016 7:33 AM
  • Hi, Sajad,

    We will try repro this issue at our local environment and do further investigation, will response to you when have some finding.

    Monday, October 24, 2016 12:23 PM
  • Hi, Sajad,

    Maybe the issue is due to the IP is changed after grow, if possible, can you do the following testing.

    1, grow the nodes,

    2, disable auto grow shrink temparaly

    3, set static internal ip for each node, it can be done through Azure powershell, what is your deployment, using ARM template or our deployment script?

    for classic VM, you can refer to https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-reserved-private-ip/

    for ARM vm, you can refer to https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-static-private-ip-arm-ps/,

    you can add a static IP for existing VM

    4, enable auto grow shrink again, then verify that after grow, whether the job can be finished

    Any questions, please let me know

    Tuesday, October 25, 2016 8:12 AM
  • I did set static internal IP for all the 50 compute nodes but it didn't help and I am still facing the same issue.

    It may be noteworthy when I tested with the sample Excel workbook provided by MS named ConvertiblePricing_Complete.xlsb (https://www.microsoft.com/en-us/download/details.aspx?id=2939) I encountered the exactly same issue when I tried to run it on all the cores (200 cores, 50 nodes), job was getting completed on HPC cluster but results were not returned to the client workbook. However, when I changed it to just 16 cores (4 nodes, as shown below) it worked perfectly fine. It provisioned the compute nodes, did the calculations and send the results to the client workbook successfully.

    HPCExcelClient.OpenSession headNode:=HPC_ClusterScheduler, remoteWorkbookPath:=HPCWorkbookPath, useAzureQueue:=False, minResources:=12, maxResources:=16

    However, when I tried to use same number of cores/settings in my own xlsm file, it didn't work and I encountered the same issue, job gets completed 100 percent on cluster but doesn't finish and gets hung in the Running state. 
    Thursday, October 27, 2016 10:09 AM
  • Hi, Sajad,

    Thanks for the verification for static IP, after grow, can you open hosts file (under C:\Windows\System32\drivers\etc) on head node to check whether the IP of each compute node is correct?

    also we will try to repro this issue at our environmentent by using ConvertiblePricing_Complete.xlsb

    Friday, October 28, 2016 1:23 AM
  • Hi Sajad,

    If I understand correctly, this issue of running jobs without resposnes happens only during the AutoGrowShrink time period? Once all the 50 ndoes are started online, the Excel workbook and the session job would complete with all the 200 cores?

    We may need the broker worker logs on the head node (also the broker node) to investigate why the responses were not sent back to the Excel client. Could you log on the head node VM, and do the following to collect the broker worker logs?

    1. Open the folder %ccp_data%LogFiles\SOA, and copy all the HpcBrokerWorker_*.bin files over to a local folder.

    2. Zip the folder and send it to me via yutongs@microsoft.com

    If the log files are too many, you may delete all the HpcBrokerWorker_*.bin files on the head node, repro this issue, and then copy the newly generated HpcBrokerWorker_*.bin files. Last but not the least, let me know the session job id for the problematic session.

    Regards,

    Yutong Sun

    Friday, October 28, 2016 2:25 AM
  • In hosts file (under C:\Windows\System32\drivers\etc) on head node, I do see IP of each compute node is listed twice, is that an issue?

    ...
    192.168.0.147            Enterprise.TWRS02CN-1047       #HPC
    192.168.0.148            Enterprise.TWRS02CN-1048       #HPC
    192.168.0.149            Enterprise.TWRS02CN-1049       #HPC
    192.168.0.4              Enterprise.TWRS02HN            #HPC
    ...
    192.168.0.147            TWRS02CN-1047                  #HPC
    192.168.0.148            TWRS02CN-1048                  #HPC
    192.168.0.149            TWRS02CN-1049                  #HPC
    192.168.0.4              TWRS02HN                       #HPC

    Friday, October 28, 2016 7:27 AM
  • no, it is not the issue, it is maintained by HPC management service,

    I just want to check, whether it contains Azure nodes,  and after you grow 50 nodes and the job failed, whether the IP of each azure node in hosts is correct,

    before you set the static ip for each azure node,  when shrink the azure node, and grow again, the ip of each azure node is dynamic, so IP maybe changed,

    HPC need 30 seconds to update the host file, so before that if job dispatch to azure node, it may cause the issue, but these are just our guess.

    So we ask you to set the static IP of each node to avoid IP change impact. But seems still has issue. so please help to send the log to Yutong, and we will try to repro this issue locally and do further investigation.

    BTW, what is your HPC version, you can open HpcClusterManager and open menu Help->About, it contains the client and server version

    Friday, October 28, 2016 7:45 AM
  • Yes, the issue of not get responses back occurs only when all nodes are deallocated and AutoGrowShrink has to start them/bring them online. In the sample workbook (ConvertiblePricing_Complete.xlsb) this happens whenever job will complete with more than 16 cores, on 16 cores it works but in our workbook it doesn't actually matter, if fails all the time even if cores specified are 16 or less.

    I have sent the zip of %ccp_data%LogFiles\SOA\HpcBrokerWorker_*.bin files to yutongs@microsoft.com.

    Session Job Id is 34.

    Just FYI,

    When the job got stuck at 100 % (status as Running), I cancelled the job first and then zipped the HpcBrokerWorker_*.bin files.

    Friday, October 28, 2016 8:08 AM
  • Friday, October 28, 2016 2:04 PM
  • Issue has been related to the Azure load balancers timing out.

    We finally used Azure storage queue in communication between HPC client and head node, we can specify the queue in the open session command.

    Thursday, December 8, 2016 10:17 PM