none
Head node can't see client node in the network RRS feed

  • Question

  • Hi,

    I am trying to establish a HPC environment on several Windows Server 2016 with HPC Pack 2016. The head node works fine although it doesn't have a domain, only with local computer name. The client node and head node could ping each other either by local computer name or IP address.

    The client node follows the Step 4: Add Windows nodes to the cluster and doesn't perform any error while adding head host and verify the cert. After adding on client completed, I could start the Cluster Manager on Client side which connects me to the head node.

    The problem is that this client node is showing inside the HPC Cluster Manager anywhere. When I was 'add nodes to your cluster' , nothing showed up.

    Please help, thanks.

    Tuesday, June 9, 2020 2:21 AM

Answers

  • Hi Bingyu,

    By client node, do you mean compute node? Did you run setup.exe on the compute node to do the installation and connect to the head node? You may try to add the compute node name in the hosts file on the head node and restart HpcManagement service to see if it works.

    Note, we only support non-domain joined cluster on Azure. For on premises clusters, it is recommended to join the nodes to a domain, although simple scenarios can work without domain.

    Regards,

    Yutong Sun


    • Marked as answer by Bingyu_Huang Monday, June 22, 2020 11:39 PM
    Tuesday, June 9, 2020 3:15 PM

All replies

  • Hi Bingyu,

    By client node, do you mean compute node? Did you run setup.exe on the compute node to do the installation and connect to the head node? You may try to add the compute node name in the hosts file on the head node and restart HpcManagement service to see if it works.

    Note, we only support non-domain joined cluster on Azure. For on premises clusters, it is recommended to join the nodes to a domain, although simple scenarios can work without domain.

    Regards,

    Yutong Sun


    • Marked as answer by Bingyu_Huang Monday, June 22, 2020 11:39 PM
    Tuesday, June 9, 2020 3:15 PM
  • Hi Yutong,

    Thanks for letting me know. Yes, I was referring compute node. I ran setup.exe several times to on the compute node and set up the host file for both the head node and the compute node.  The strange thing was that I could even launch and access the "HPC Cluster Manager" app on the compute node, only not showing the compute node in any of "unapproved" or other tabs.

    I have moved forward with the domain to keep my project going, while it's going to be helpful if you could explain a little bit more about the 'simple scenarios' you mentioned? What kind of activities would be defined as simple?

    BTW, I have been running some cluster test using mpipingpong.exe be default, while the info seems to be useful but they seems only being available after a task is down, if there a way to get the ongoing metrics? 

    Thanks

    Monday, June 22, 2020 11:39 PM
  • Hi Bingyu,

    For non-domain joined compute nodes, you may need to add the head node name/IP in the hosts file on the compute nodes as well and restart the HpcManagement service, so that the compute nodes could report to the head node.

    By simple scenarios, I means submitting jobs/tasks that can run under local user account. For mpipingpong, there seems only aggregated final results without intermediate metrics.

    Regards,

    Yutong Sun

    Wednesday, July 8, 2020 9:48 AM
  • Hi Yutong,

    Thanks for replying.

    I am continuing exploring with the HPC cluster right now and found that hard to get a clear idea about how the certificate should be used in the best practice. I don't have existing CA that meets the requirements so I am using self-signed certificate, was generating by UI installer, but would like to use the CreatHpcCertificate.ps1.

    What I am doing now is:

    1.Generating a .pfx for head node using CreatHpcCertificate.ps1, executing an unattended setup.exe installation passing .pfx path as parameters.

    2.Clicking Import a certificate for deployment on the HPC manager App(How could I do this using HPC PowerShell?) and I could get both the public .cert from head node and HpcCnCommunication.pfx

    3. Transferring both the .pfx and .cert to a new compute node, executing an unattended setup.exe installation passing .pfx path and .cert path as parameters.

    The cluster seems to be working well in this way, but I am not sure if it's the best practice.

    My concerns are why do I suppose to pass HpcCnCommunication.pfx which contains a private key from a machine to another one? I tried to generating a .pfx  by using CreatHpcCertificate.ps1. on the new compute node itself without using the one from the head node and installing with the public .cert from the head node. But it's not working properly as the same result in this problem thread -- the compute node could login the HPC Manager app purely like a client but not able to show as a node.

    Please help with some explanations, thanks.

    Thursday, July 9, 2020 3:26 PM
  • Hi Bingyu,

    Yes, it is feasible to use the same communication certificate for .pfx file on the head node and compute nodes. Check this for details. The certificate is used to secure the connections among the nodes.

    Regards,

    Yutong Sun

    Monday, July 13, 2020 10:02 AM