HN and ComputeNodes go offline when the servers are restarted RRS feed

  • Question

  • Is there a cluster property that moves the HN and ComputeNodes from Offline to Online when the corresponding servers are restarted? Currently, these nodes always go into Offline state when the servers are restarted. 
    Tuesday, October 16, 2018 1:16 AM

All replies

  • The cluster shall not take nodes offline after they are restarted. Could you reproduce this issue and send us the 3 HpcManagement_AA_<index>.bin files with latest indexes? The bin files are in %CCP_DATA%LogFiles\Managment\ on the head node.
    Tuesday, October 16, 2018 2:02 AM
  • Here are the bin files

    • Edited by SRIRAM R Tuesday, October 16, 2018 1:43 PM
    Tuesday, October 16, 2018 1:41 PM
  • Thanks for sharing the logs.

    Could you point out which node is automatically taken offline? And when this happened?

    I found the head node node LCO-TV-CTVHPC had been automatically taken offline on 12:23 Oct.14 UTC time. Per the log, it is because the physical memory for the machine had been changed. But after that, i didn't see any automatic taking offline operation.

    Wednesday, October 17, 2018 3:17 AM
  • LCO-TV-CTVHPC is a HN. And LXN-AV-CTV1,  LCO-TV-TFS and LCO-TV-CTV1 are Compute Nodes.

    I am told that these are all VMs in Hyper-V and that memory allocation to these servers is dynamic. Perhaps when a restart happens, the servers start off with a lower memory and based on the usage it spikes up -- Is that when the 'memory change' is being detected and shutting down the nodes? Should the nodes not 'adapt' to such changes?

    Thursday, October 18, 2018 2:00 AM
  • That should be the reason. If the hardware (such as CPU or Memory) is changed for an online node, it will be automatically taken offline to apply the changes. 

    Could you use VMs with static memory?

    Monday, October 22, 2018 2:17 AM
  • Dont think its possible -  We are using one of the memory management features of Hyper-V, another fantastic product from Microsoft. Its a bit surprising that Microsoft HPC Server does not work well with that feature enabled.

    That said, could you possibly take this as a feedback (and include a patch from your end in the next QFE/Update) ? 

    Interim, if you can provide some private bits that addresses this issue (dynamic memory), I can patch the nodes.

    Thank you!

    • Edited by SRIRAM R Monday, October 22, 2018 7:19 PM
    Monday, October 22, 2018 5:42 PM
  • Thanks for the feedback, we will add it to the list of new feature requests, but at this stage, there is no plan when it can be done. 

    As a workaround, you can run the following PowerShell script on the head node, it will check whether there are offline node every 30 seconds, and bring the nodes online:

    Add-PsSnapin Microsoft.Hpc
        Get-HpcNode -State Offline -ErrorAction SilentlyContinue | Set-HpcNodeState -State online -ErrorAction SilentlyContinue
        Start-Sleep -Seconds 30

    Wednesday, October 24, 2018 1:52 AM