3 HPC head nodes in High Availability with builtin Service Fabric across 2 Data Centers RRS feed

  • Question

  • Hi All ,

    What is the best practice for setting up HPC grid in 2 data centers for disaster recovery? 

    Do we create 3 headnodes in each DC ( total of 6 )  and connect all to single remote DB or should we have separate remote DB for each cluster  ?

    Should we be able to continue run jobs in the DR datacenter  after  primary datacenter power down ? 

    any documentation describing above would be appreciated 

    Appreciate help



    • Edited by juliakir Monday, January 20, 2020 5:27 PM
    Monday, January 20, 2020 5:24 PM

All replies

  • Hi Julia,

    Do you want to have one HA cluster across two DCs? That is, in normal situation HA head nodes in two DCs manage compute nodes in both DCs. If one DC is down, the rest HA head nodes manage the rest compute nodes in the other DC. Clients always see one HA cluster with one cluster name/connection string?


    Yutong Sun

    Friday, January 31, 2020 6:22 PM
  • Hi Yutong ,

    that is correct . As per my findings HPC 3 headnode cluster cant work when 2 headnodes are down and only 1 up . If we totally shutdown  Datacenter1( in case of disaster)  and we have 2 nodes in that datacenter then Datacenter2 with 1 headnode only will not work 

    please advice as that is delaying our implementation greatly



    Friday, January 31, 2020 7:15 PM
  • Correct, Julia. So we may need a third site/DC for the third head node to make this HA scenario work.

    As you may know, we will release Microsoft HPC Pack 2019 shortly by end of this March. In HPC Pack 2019 we will support a built-in light weight HA model for head nodes. The setup and maintenance should be simple and only two head nodes are required. The preview version was released last Nov. Please check the preview info from https://aka.ms/hpcgit


    Yutong Sun

    Thursday, February 6, 2020 10:48 AM