none
New HPC Pack 2016 Deployment RRS feed

  • Question

  • We've been running HPC Pack 2008 on a 2008R2 blade cluster for a few years, and it was a doddle to set up, and worked pretty much faultlessly.

    We're now upgrading everything to Server 2016/HPC Pack 2016, and there are problems from the off. We have a single head node and will have multiple compute nodes. I'm building the head node now, and read through the MS docs and walkthrough on setup.

    The head node installation went through ok with no errors. I had concerns about the cert (we're going to use a single cert for everything), but the installer detected the one I'd setup and enrolled, and all looked ok.

    All services started ok, but when I try to run cluster manager, it's times out, and eventually reports;

    The connection to the management service failed. detail error: Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> Microsoft.SystemDefinitionModel.InstanceCacheLoadException: The instance 00000000-0000-0000-0000-000000000000 cannot be resolved in the current instance view.
       at Microsoft.SystemDefinitionModel.InstanceSpace.CommittedInstancesView.ResolveToFullId(Guid id)
       at Microsoft.SystemDefinitionModel.ModelQuery.ResolveToFullInstanceId(Guid instanceId)
       at Microsoft.SystemDefinitionModel.ModelQuery.GetInstance(Guid instanceId)
       at Microsoft.SystemDefinitionModel.ModelQuery.GetRootInstance(Boolean createIfMissing)
       at Microsoft.SystemDefinitionModel.ModelQuery.FindInstance(String xpath)
       at Microsoft.ComputeCluster.Management.ClusterModel.HeadNodeConnectionManager.PingNodes()
       at Microsoft.ComputeCluster.Management.ClusterModel.HeadNodeConnectionManager.InitPingNodes()
       at Microsoft.ComputeCluster.Management.ClusterManager.Initialize()
       at Microsoft.ComputeCluster.Management.ClusterManager.ConnectCoreAsync()
       at Microsoft.Hpc.RetryManager.<>c__DisplayClass34_0.<<InvokeWithRetryAsync>b__0>d.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
       --- End of inner exception stack trace ---
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__34.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__87.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__86.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__85.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Admin.ConnectionManager.<ConnectManagementService>b__10_0(Task`1 t)

    There are various event log errors such as;

    Application: HpcSessionStateful.exe
    Framework Version: v4.0.30319
    Description: The application requested process termination through System.Environment.FailFast(string message).
    Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.ArgumentNullException: Value cannot be null.
    Parameter name: dnsName
       at Microsoft.Hpc.Scheduler.Session.Internal.Common.CommonSchedulerHelper.<GetScheduler>d__0.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Session.Internal.SessionLauncher.BrokerNodesManager..ctor()
       at Microsoft.Hpc.Scheduler.Session.Internal.LauncherHostService.LauncherHostService.<OpenService>d__9.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Session.Internal.SessionLauncher.SessionLauncherStatefulService.<RunAsync>d__2.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.<ExecuteRunAsync>d__f.MoveNext()
    Stack:
       at System.Environment.FailFast(System.String)
       at Microsoft.ServiceFabric.Services.Runtime.ServiceHelper.HandleRunAsyncUnexpectedException(System.Fabric.IServicePartition, System.Exception)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter+<ExecuteRunAsync>d__f.MoveNext()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
       at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
       at System.Threading.Tasks.Task.FinishContinuations()
       at System.Threading.Tasks.Task`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Boolean)
       at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(Boolean)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter+<WaitForWriteStatusAsync>d__18.MoveNext()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
       at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
       at System.Threading.Tasks.Task.FinishContinuations()
       at System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.Threading.Tasks.VoidTaskResult)
       at System.Threading.Tasks.Task+DelayPromise.Complete()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.TimerQueueTimer.CallCallback()
       at System.Threading.TimerQueueTimer.Fire()
       at System.Threading.TimerQueue.FireNextTimers()

    Application: HpcSessionStateful.exe
    Framework Version: v4.0.30319
    Description: The application requested process termination through System.Environment.FailFast(string message).
    Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.ArgumentNullException: Value cannot be null.
    Parameter name: dnsName
       at Microsoft.Hpc.Scheduler.Session.Internal.Common.CommonSchedulerHelper.<GetScheduler>d__0.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Session.Internal.SessionLauncher.BrokerNodesManager..ctor()
       at Microsoft.Hpc.Scheduler.Session.Internal.LauncherHostService.LauncherHostService.<OpenService>d__9.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Session.Internal.SessionLauncher.SessionLauncherStatefulService.<RunAsync>d__2.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.<ExecuteRunAsync>d__f.MoveNext()
    Stack:
       at System.Environment.FailFast(System.String)
       at Microsoft.ServiceFabric.Services.Runtime.ServiceHelper.HandleRunAsyncUnexpectedException(System.Fabric.IServicePartition, System.Exception)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter+<ExecuteRunAsync>d__f.MoveNext()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
       at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
       at System.Threading.Tasks.Task.FinishContinuations()
       at System.Threading.Tasks.Task`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Boolean)
       at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(Boolean)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter+<WaitForWriteStatusAsync>d__18.MoveNext()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
       at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
       at System.Threading.Tasks.Task.FinishContinuations()
       at System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(System.Threading.Tasks.VoidTaskResult)
       at System.Threading.Tasks.Task+DelayPromise.Complete()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.TimerQueueTimer.CallCallback()
       at System.Threading.TimerQueueTimer.Fire()
       at System.Threading.TimerQueue.FireNextTimers()

    Everything seems a lot more complication in HPC Pack 2016. I haven't even installed the HPC updates yet - I just want to get the basics working. Can anyone advise? I really have to get the cluster back up and running asap.

    Thanks

    Friday, May 10, 2019 3:26 PM

All replies

  • Well, no response at all so far. As this is a brand new install, on newly commissioned Server 2016 installs, I'm surprised that I've had this problem. All these nodes are domain joined, freshly built with the plan being to have one head node using one cert. I could really do with some help here - if MS are serious about enterprises using Windows HPC, they need to provide some useful support when things don't work as the docs say they should!
    Monday, May 13, 2019 2:50 PM
  • Hi, 

    Looks like you are installing HPC Pack 2016 RTM version, it is out of date.

    Could you uninstall it and install the latest version, i.e. HPC Pack 2016 Update 2?

    You can download from the following link:

    https://www.microsoft.com/en-us/download/details.aspx?id=57344

    And From HPC Pack 2016 Update 1 on, Service Fabric cluster is not needed anymore for single head node, so you use the PowerShell script in ServiceFabric folder of the installation package to clean the Service Fabric cluster from your head node.

    .\CleanFabric.ps1

    Best Regards,

    Sunbin

    Tuesday, May 14, 2019 1:57 AM