none
HPC Cluster Manager -- Select Head Node fails after reboot RRS feed

  • Question

  • I just a had a clean install of HPC Pack for a cluster of Windows Server 2016 Essentials computers up and running this morning. The domain controller installed some Windows updates. Then, after the reboot, I can't open HPC Cluster Manager. I get a dialog asking to select the head node. When I choose the name of the domain controller (head node) I get the following messages.

    In the Windows HPC Server logs, I see warnings "Cannot connect to HPC Scheduler Service. Will try later."

    However, in Services, the HPC Job Scheduler Service is running. At command prompt, "net start hpcscheduler" says the service is already started.

    Any help appreciated.

    The connection to the scheduler service failed. detail error: System.ServiceModel.EndpointNotFoundException: Could not connect to net.tcp://mtc-dc:5802/SchedulerStoreService. The connection attempt lasted for a time span of 00:00:04.0631206. TCP error code 10061: No connection could be made because the target machine actively refused it [fc00::f145:b28f:c5bf:7e43]:5802.  ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it [fc00::f145:b28f:c5bf:7e43]:5802
       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
       at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
       at System.ServiceModel.Channels.SocketConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
       --- End of inner exception stack trace ---

    Server stack trace: 
       at System.ServiceModel.Channels.SocketConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
       at System.ServiceModel.Channels.BufferedConnectionInitiator.Connect(Uri uri, TimeSpan timeout)
       at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnection(TimeSpan timeout)
       at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.OnOpen(TimeSpan timeout)
       at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
       at System.ServiceModel.Channels.ServiceChannel.OnOpen(TimeSpan timeout)
       at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
       at System.ServiceModel.Channels.ServiceChannel.CallOpenOnce.System.ServiceModel.Channels.ServiceChannel.ICallOnce.Call(ServiceChannel channel, TimeSpan timeout)
       at System.ServiceModel.Channels.ServiceChannel.CallOnceManager.CallOnce(TimeSpan timeout, CallOnceManager cascade)
       at System.ServiceModel.Channels.ServiceChannel.EnsureOpened(TimeSpan timeout)
       at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
       at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
       at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

    Exception rethrown at [0]: 
       at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
       at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
       at Microsoft.Hpc.Scheduler.Store.ISchedulerStoreInternal.Register(String clientSource, String userName, ConnectionRole role, Version clientVersion, ConnectionToken& token, UserPrivilege& privilege, Version& serverVersion, Dictionary`2& serverProps)
       at Microsoft.Hpc.Scheduler.Store.StoreServer.RegisterWithServer()
       at Microsoft.Hpc.Scheduler.Store.StoreServer.RegisterEvent(String schedulerNode)
       at Microsoft.Hpc.Scheduler.Store.StoreServer.<ConnectWcfAsync>d__39.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Store.StoreServer.<InternalConnectAsync>d__38.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Store.StoreServer.<ConnectAsync>d__34.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.<InitializeAsync>d__41.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.<RemoteConnectAsync>d__4.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStore.<ConnectAsync>d__1.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStore.<ConnectAsync>d__0.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Admin.ConnectionManager.ConnectScheduler(Object sender)


    Sunday, March 3, 2019 9:52 PM

All replies

  • After another reboot, the boot sector of the operating system SSD is corrupted and the domain controller no longer boots. Brand new SSD. Can't understand what would have corrupted the boot sector. It appears the automatic updates downloaded something that really upset the system configuration. Unfortunately, right now it appears the whole system needs to be reinstalled from scratch and we will be replacing the hard drive just to be safe.

    Monday, March 4, 2019 1:37 PM
  • Hi Nate Hayes,

    Generally, we don't recommend to use AD controller as the head node for production. Head node could be heavily loaded which could affect the AD functions.

    The error looks the scheduler service did not open the service endpoints successfully though it is running. Need to check the scheduler service logs to do further investigation.

    Regards,

    Yutong Sun

    Tuesday, March 5, 2019 7:55 AM
    Moderator
  • Hi Yutong,

    Thank for the reply. It's a development lab environment with high security clearance for just a few administrators/users. Further diagnostics confirmed the SSD drive that the OS was installed on had failed. We're adding four new drives configured in RAID 1 for redundancy and then going to do a clean install of the OS and HPC Pack. We'll see if there are any further troubles at that time, but right now it appears this was definitely a hardware failure.

    Wednesday, March 6, 2019 2:42 PM
  • Yutong, one question: we had already activated the license keys before the drive failure. After we do the clean install, are there going to be any issues re-activating Windows Server 2016 using the same keys?
    Wednesday, March 6, 2019 7:36 PM
  • this is windows license activation question. I suppose you can and if you can't, you can call Microsoft support for this.

    Qiufang Shi

    Friday, March 8, 2019 7:48 AM