locked
Retrieving windows hpc node health programmatically RRS feed

  • Question

  • Hello,

    I am currently using Microsoft HPC API. I can get the Node State but I don't find any way to retrieve node health programmatically. Especially I want to know if a node health is equal to "Transitional". Is there any way to get this information programmatically.

    Thanks,

    Puneet


    Puneet Sharma

    Wednesday, October 11, 2017 2:24 AM

Answers

  • "Draining" state when online to offline means the node is waiting for the job to be finished from that node.

    And offline to online, there is no middle state.

    And for your approach, your challenges will be how to determine what nodes to start the and make them online and how many needed especially when you dealing with jobs requested different node groups. But for simple batch job scenarios, it shall just work.

    And from my opinion, implementing Start-HPCIaaSNode.ps1, Stop-HPCIaaSNode.ps1 for good cloud nodes is the simplest and fast way as you will have that functionality yourself anyway


    Qiufang Shi

    Friday, October 13, 2017 1:57 AM

All replies

  • Hi Sharma,

      "Node Health" is cluster admin concept. You shall use powershell "Get-hpcNode" to get the information. Currently admin API is not publicly documented.

      As you're using HPC API (I suppose you're using C# API), then you shall understand below Node property:

    - Node State: Online/Offline/Draining  --> When offline, the node is not available for job scheduling, for example, the node is going to do update/patching

    - Node Reachable/Unreachable --> If the node is unreachable while it is online, scheduler will not scheduler job as well.

    And usually, "Transitional" Node will be mapped to "Draining" Node State. But why you want to know the "tranisitional" health state?


    Qiufang Shi

    Wednesday, October 11, 2017 4:33 AM
  • Hi Quifang,

    Thanks for the quick reply.  Let me explain my problem in detail

    " We are having a long-running job on the head node, which is responsible to start and stop the Google cloud nodes based on the job queue size. We are designing a very basic algorithm to start and stop the dedicated cloud nodes based on the job size.

    To achieve this, our job does the following

        (1) If node is online and idle, we first make this HPC node offline and then stop the associated google cloud VM.

        (2) If node is offline and selected for the job execution,  we first make this node online and then start the associated good cloud VM.

    HPC node doesn't go to offline or online state directly. There is some intermediate state associated during this transition like node has draining state when it moves from online to the offline state. Unfortunately, when node goes from offline to online, I did not find any intermediate node state but observed that nodes health is transitional. That's the reason I want to capture the transitional state. 

    So in a nutshell, I want to get the intermediate states of the nodes when they go from online to offline or vice versa.  These intermediate states will help me in not considering these nodes for the new jobs. How can I achieve this?

    Thanks,

    Puneet

    Wednesday, October 11, 2017 3:22 PM
  • Hi Sharma,

      I suppose you're implementing auto grow shrink for other cloud. To do this, you shall:

    - Make use of the builtin auto grow shrink script (under %CCP_HOME%bin):

    1. AzureAutoGrowShrink.ps1 , you can modify it as you need. This script should call the below script only to manipulate the node: *-HPCIaaSNode.ps1

    2. Start-HPCIaaSNode.ps1, Stop-HPCIaaSNode.ps1, Remove-HPCIaaSNode.ps1, Add-HPCIaaSNode.ps1

    When you implement you own *-HPCIaaSNode.ps1 for google cloud, you shall get the auto grow shrink based on job queue.

    And in the powershell script, you are able to get node health state as well through:

    - Add-pssnapin Microsoft.hpc

    - Get-hpcnode


    Qiufang Shi

    Thursday, October 12, 2017 2:08 AM
  • Hi Quifang,

    For this project, we wanted to implement something faster, so we decided with this job based approach. Understanding and editing multiple power shell scripts might take some time for us. We can do that in the future phases but currently we are thinking this job based approach.

    Could you provide your opinion about our approach, whether it's doable or you see some challenges here, especially could frequently marking node online/offline have negative impact on HPC or not? It would be really great if you can highlight some concerns in our approach, so that we can either address them in our implementation or do some workaround for sometime.


    Thanks,

    Puneet


    Thursday, October 12, 2017 1:30 PM
  • "Draining" state when online to offline means the node is waiting for the job to be finished from that node.

    And offline to online, there is no middle state.

    And for your approach, your challenges will be how to determine what nodes to start the and make them online and how many needed especially when you dealing with jobs requested different node groups. But for simple batch job scenarios, it shall just work.

    And from my opinion, implementing Start-HPCIaaSNode.ps1, Stop-HPCIaaSNode.ps1 for good cloud nodes is the simplest and fast way as you will have that functionality yourself anyway


    Qiufang Shi

    Friday, October 13, 2017 1:57 AM
  • Thanks a lot Qiufang. I agree with you. Let us proceed with simple batch job approaches for now, and in parallel we will work on the power shell scripts modifications from our end. 


    Puneet Sharma

    Friday, October 13, 2017 1:40 PM