locked
HPC broken? RRS feed

  • Question

  • I am running a cluster of four servers for development and testing purposes. I have Windows Server 2012R2 installed on all machines and Microsoft HPC Pack 2012R2. After applying Windows updates some time ago, the cluster stopped working. It seems that there is some authentication issue.

    I reinstalled the cluster from scratch - both OS and HPC pack. It still has this issue. It seems that the nodes can't access  shared resources. My log files have gigabytes of error messages the type of 'password for MYDOMAIN\myuser is incorrect'.

    I tried installing HPC Pack 2016 instead - I could not install it at all. Then I reinstalled the head node to Windows Server 2016 and attempted an HPC Pack 2016 installation again - it failed too.

    I get to the point when the installer installs prerequisites. As it starts installing Microsoft Service Cluster, a PowerShell window appears with the following question:

    "Security warning
    Run only scripts that you trust. While scripts from the internet can be useful, this script can potentially harm your
    computer. If you trust this script, use the Unblock-File cmdlet to allow the script to run without this warning
    message. Do you want to run E:\data\HPC Pack
    2016\ServiceFabric\DeploymentComponents\Microsoft.ServiceFabric.Powershell.Types.ps1xml?
    [D] Do not run  [R] Run once  [S] Suspend  [?] Help (default is "D"):"

    I press R and a similar message appears again. I press R again. Then the installer shows the following error message:

    "Component Microsoft Service Fabric Cluster cannot be installed with error code 1. If a different version of Service Fabric runtime or SDK has already been installed in this machine, uninstall them and run setup. For more details, please check the Service Fabric deployment logs in the folders setup\Deployment Traces and C:\ProgramData\SF\Log\Traces."

    I don't have a prior installation of Service Fabric runtime or SDK and the above-mentioned directories don't even exist. If I install the service runtime and SDK manually from outside the HPC Pack installer, it complains that the fabric service is not running. I start it manually but the installer is still complaining that the service is not running.

    Does anybody here know what's wrong with the HPC packs recently and how to work around the issue?

    Tuesday, January 24, 2017 2:58 PM

All replies

  • Hi,

      Have you checked our release notes for HPC Pack 2016? You need to un-block the downloaded file first otherwise the service fabric installation will fail.


    Qiufang Shi

    Wednesday, January 25, 2017 2:54 AM
  • Thanks for helping! When I downloaded the ZIP file on a local disk that is not shared and when I checked Unblock, I was able to proceed with the installation. However, after installation I could not start HPC Cluster Manager. It gives me the following error when I try to select a head node:

    The connection to the management service failed. detail error: Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> Microsoft.SystemDefinitionModel.InstanceCacheLoadException: The instance 00000000-0000-0000-0000-000000000000 cannot be resolved in the current instance view.
       at Microsoft.SystemDefinitionModel.InstanceSpace.CommittedInstancesView.ResolveToFullId(Guid id)
       at Microsoft.SystemDefinitionModel.ModelQuery.ResolveToFullInstanceId(Guid instanceId)
       at Microsoft.SystemDefinitionModel.ModelQuery.GetInstance(Guid instanceId)
       at Microsoft.SystemDefinitionModel.ModelQuery.GetRootInstance(Boolean createIfMissing)
       at Microsoft.SystemDefinitionModel.ModelQuery.FindInstance(String xpath)
       at Microsoft.ComputeCluster.Management.ClusterModel.HeadNodeConnectionManager.PingNodes()
       at Microsoft.ComputeCluster.Management.ClusterModel.HeadNodeConnectionManager.InitPingNodes()
       at Microsoft.ComputeCluster.Management.ClusterManager.Initialize()
       at Microsoft.ComputeCluster.Management.ClusterManager.ConnectCoreAsync()
       at Microsoft.Hpc.RetryManager.<>c__DisplayClass34_0.<<InvokeWithRetryAsync>b__0>d.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
       --- End of inner exception stack trace ---
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__34.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__87.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)


    If I try reinstalling the HPC pack, the installer says that fabric service is installed but not running. Any clues what could be wrong this time?

    Wednesday, January 25, 2017 9:44 AM
  • Hi, 

    After the installation completed, if you open HpcClusterManager at once, you may occur this error, as the Service need some time to do initialization,

    you can try later, it should be successful.

    BTW, you only have single head node, right? for single head node, it may need more initialization time.

    Monday, February 6, 2017 1:59 AM
  • This thread may be a bit old, but the issue is still very much present. I too have experienced this exact problem.

    I let HPC Cluster Manager initialize for a day but it still cannot start up. Additionally, HPC Pack 2016 is incapable of uninstalling itself correctly from a head node running Windows Server 2012 R2. The installer complains that service fabric is installed but not running, and any attempts to stop/delete/uninstall service fabric components/services only breaks the HPC Manager installer further (I am now at the point where the installer displays cryptic "fatal error" messages and an error code with no online documentation to accompany it).

    Currently the only solution to reinstalling HPC Pack 2016 on a head node that I know of is a fresh installation of the entire OS.

    If anyone is aware of a way to 1) properly uninstall HPC Pack 2016 on a head node; and 2) fix HPC Pack 2016 when it fails to start after installation and displays the above error message (The connection to the management service failed. detail error:...); please let me know!

    Many thanks in advance to anyone who alleviates my frustrations with this program.

    Friday, June 2, 2017 6:06 PM
  • Hi Hunter,

    besides uninstall HPC service component from control panel, you also need remove hpcApplication and HPCApplicationType from service fabric management portal,

    you can visit it from head node: https://localhost:10400

    BTW, can you send the following logs on head node to me, HpcManagementHN*.bin under C:\Program Files\Microsoft HPC Pack 2016\Data\logfiles\management,

    you can send to me through email yongtia@microsoft.com

    • Proposed as answer by Hunter02 Wednesday, June 7, 2017 6:14 PM
    Monday, June 5, 2017 1:59 AM
  • I was able to get HPC Pack 2016 successfully installed. However, I had to look at the error logs posted in Windows Event Viewer to figure out what was wrong. It turns out that HPC Pack did not like the DNS name I provided when creating my SSL certificate, so leaving this option unspecified when creating another fixed the issue.

    From a user perspective, I think it would be helpful to display error messages more specific to the issue within HPC Pack Manager itself. The software appears to have correctly identified that the certificate was trying to use the wrong DNS name, but the need to look in other places to find out what the problem is makes solving it more difficult. It seems that users receive the above generic error message regardless of the reason why HPC Manager cannot start.

    Regarding properly uninstalling HPC Pack 2016, I encounter an error when I try to access https://localhost:10400

    Both Internet Explorer and Google Chrome report that access to localhost:10400 is denied. Would you happen to have any suggestions?

    I greatly appreciate your help and patience, many thanks.

    Monday, June 5, 2017 8:13 PM
  • Hi Hunter,

    Thanks for your suggestion, we will consider to improve in HPC 2016 Update 1.

    For access localhost:10400 denied issue, you need install that certificate to personal cert store for current user, as it requires client certificate authentication for service fabric portal

    Tuesday, June 6, 2017 6:37 AM
  • I was able to get as far as installing the very last HPC component and then it failed. However, after performing a combination of deactivating the nodes on service fabric management portal, deleting Microsoft HPC Pack 2016 from Program Files, and restarting the computer, I successfully reinstalled HPC Pack 2016 on the head node. I then re-entered service fabric management portal and reactivated the nodes.

    I'm not clear on which of the steps worked, but at least it worked. If I figure out exactly what fixed it in the end, I'll post it.

    Thanks again for your help, greatly appreciated!

    Wednesday, June 7, 2017 6:14 PM
  • Hello Yongjun,

    Please provide a way to replace a certificate or how to fix the problem by explicitly specifying a DNS identity.

    It is a hell of a job to reinstall the complete package when you make a mistake with the certificate.  And it is not clear what is used from the certificate to determine the DNS name. Is it the CN or the SAN?

    Thanks  

    Friday, July 28, 2017 12:43 PM
  • Where you able to get a solution ?
    Friday, November 10, 2017 1:15 AM