none
HPC Pack 2016 Error connecting to head node remotely through HPC Cluster Manager RRS feed

  • Question

  • Hello,

    Our cluster has Windows HPC Pack 2016 Update 2 with patch KB4481650-x64. We are able to run the HPC Cluster Manager from the headnode (using localhost) and from the compute nodes (remotely connecting to the headnode) without any issues. On another computer which is not part of the cluster, the client utilities have been installed but trying to connect to the headnode through the HPC Cluster Manager throws an error. The HPC Job Manager has now issues to connect.

    The beginning of the error message is provided at the end of this post. The message itself is very long but I truncated it to what I thought mattered since I had to type it.

    I tried reinstalling the client utilities. Initially I installed it selecting 'Skip CN and CA' in the Certificate step of the installation. I have since tried all 3 options (CN&CA, CN, and none) using the self-signed certificate that I generated when creating the cluster.

    Any suggestion on how to solve this issue?

    Thanks,

    -Michael

    The connection to the management serice failed. detail error: Microsoft.Hpc.etryCountExhaustException: Retry Count of RetryManager is exhausted. ---> System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel. ---> System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.

    Tuesday, January 22, 2019 8:31 PM

Answers

  • Good catch, Michael! This is confirmed as an issue in the QFE KB4481650 (version 5.2.6291). When installed on a client machine, it would alter the previous setup from "Skip both CN and CA validation" to "Skip CN validation", thus the SSL/TLS certification validation would fail. There is a simple workaround, just update the following registry key on the client to make it work.

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\HPC]
    "CertificateValidationType"=dword:00000000

    Regards,

    Yutong Sun

    • Marked as answer by MichaelEnders Monday, January 28, 2019 1:07 PM
    Friday, January 25, 2019 9:47 AM

All replies

  • Hi Michael,

    This looks a certificate issue. Did you follow the step here to create the self-signed certificate for the cluster? Normally there is no need to import the cert on the client machine if you choose to skip both CN and CA validations.

    Regards,

    Yutong Sun

    Wednesday, January 23, 2019 8:31 AM
  • Normally there is no need to import the cert on the client machine if you choose to skip both CN and CA validations.

    Hi Yutong,

    Does this comment also apply to the HPC Cluster Manager software? As mentioned HPC job manager works..  the cluster manager is that doensn't, and I tried all 3 certificate options during client installation.

    On the server side I followed the steps provided in the instruction set that you linked to. On step 1.8 I created a self-signed certificate from the installation wizard since our system is a single headnode cluster. And then on step 3.4 created a self-signed certificate for the compute nodes in the cluster manager.

    Any more ideas?

    Thanks,

    -Michael

    Thursday, January 24, 2019 1:18 PM
  • Hi Yutong,

    I tested a few things it seems the issue may be that something got broken between HPC Pack versions 5.2.6277.0 (from "HPCPack2016Update2-Full-v6277") and 5.2.6291.0 (from "KB4481650_x64").

    • With headnode, compute nodes and clients on version 5.2.6291.0, I'm having the issues described in the original post.
    • With headnode and compute nodes on version 5.2.6291.0, and clients on version 5.2.6277.0, I'm able to connect to the headnode through the HPC Cluster Manager without any issues.
    • If I apply the patch on the client and bring everything back to 5.2.6291.0, then the issues in the original post return.

    FYI, I have not tried the functionality with everything at the 5.2.6277.0, since I don't want to reinstall/reconfigure the headnode/compute nodes. I assume that it works the same as the case above in which I'm running both with different versions.

    A 'separate' issue also seems to have been introduced, which is related to this post. Are you able to confirm the issue on 5.2.6291.0?

    Regards,

    -Michael

    Thursday, January 24, 2019 1:49 PM
  • Good catch, Michael! This is confirmed as an issue in the QFE KB4481650 (version 5.2.6291). When installed on a client machine, it would alter the previous setup from "Skip both CN and CA validation" to "Skip CN validation", thus the SSL/TLS certification validation would fail. There is a simple workaround, just update the following registry key on the client to make it work.

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\HPC]
    "CertificateValidationType"=dword:00000000

    Regards,

    Yutong Sun

    • Marked as answer by MichaelEnders Monday, January 28, 2019 1:07 PM
    Friday, January 25, 2019 9:47 AM
  • Hello Yutong,

    The registry key modification suggested fixed the issue with my version 5.2.6291.0.

    Thanks for coming up with a solution in a quick manner.

    Regards,

    -Michael

    Monday, January 28, 2019 1:07 PM
  • Hi there,

    we have a similar problem like Michael in the first post. The connection from headnode, computenodes and workstation nodes with the HPC Cluster Manager works fine. All these computers are domain joined. From other computer, which are not part of the cluster and not domain joined the HPC Cluster Manager throws the error from the first post. The installation on these computers was made with "Check both CA and CN" and we imported the certificate from the installation wizard from step 1.8 from the instaruction set. Editing the registry for the validation type to 0 throws an other error. The HPC Job Manager runs fine, but not the HPC Cluster Manager.

    We tried to install both Certificates (HpcHnPublicCert.cer and HpcCnCommunication.pfx) which were self signed in local computer and user cert store in different places. Nothing helps.

    Any more ideas?

    Regards,

    Thomas


    • Edited by YoDommo Tuesday, March 12, 2019 5:30 PM
    Tuesday, March 12, 2019 5:02 PM
  • Hi Youtong, i have same issue, but after checking registry i've got some another error.

    here is the beginning of it:

    The connection to the scheduler service failed. detail error: System.ArgumentNullException: Value cannot be null.
    Parameter name: findValue
       at System.Security.Cryptography.X509Certificates.X509Certificate2Collection.FindCertInStore(SafeCertStoreHandle safeSourceStoreHandle, X509FindType findType, Object findValue, Boolean validOnly)
       at System.Security.Cryptography.X509Certificates.X509Certificate2Collection.Find(X509FindType findType, Object findValue, Boolean validOnly)

    Client machine isn't member of cluster domain

    On client machine i've setup client utilities from update 2 and KB4481650. 

    And on cluster is the same version

    • Edited by daboriginal Thursday, June 20, 2019 6:28 PM
    Thursday, June 20, 2019 6:26 PM