Answered by:
unable to connect ot the HPC Manger or job scheduler

Question
-
I am unable to open the manager or the job scheduler either as a client or on one of the 3 servers. I am not sure what is going on but it seams like the entire HPC cluster is down.
in the Event logs I am getting a lot of warnings EventID 23041:
help!!!!
Darren
Wednesday, March 27, 2019 8:19 PM
Answers
-
I saw the error message "Login failed for user 'UAH\THOR1$'.", could you check whether this user is authorized to access the HpcManagement database?
For the log files, you can send to hpcpack@microsoft.com
- Edited by Sunbin Zhu Thursday, April 4, 2019 3:29 AM
- Marked as answer by DCSpooner Thursday, April 4, 2019 11:27 AM
Thursday, April 4, 2019 3:21 AM
All replies
-
I see an error of EventID 23040: applicationHostProxy: openProcess:errorcode=E_fail,parentID=6704. fabric will not monitor apphost {some GUID}Wednesday, March 27, 2019 8:21 PM
-
Could you open a PowerShell console on the head node as administrator, and run the following commands?
Connect-ServiceFabricCluster
Get-ServiceFabricClusterHealth
Get-ServiceFabricApplicationHealth -ApplicationName fabric:/HpcApplication
Thursday, March 28, 2019 5:38 AM -
well here is my log info, but now i can connect for some reason. i did do windows updates and reboot all servers and then had to do somehting else and had forgotten about it until now.
AggregatedHealthState : Warning
UnhealthyEvaluations :
Unhealthy event: SourceId='System.UpgradeOrchestrationService',
Property='ClusterVersionSupport', HealthState='Warning', ConsiderWarningAsError=false.
The current cluster version 6.3.176.9494 support ends 3/31/2019 12:00:00 AM. Please view
available upgrades using Get-ServiceFabricRegisteredClusterCodeVersion and upgrade using
Start-ServiceFabricClusterUpgrade.
NodeHealthStates :
NodeName : Thor3
AggregatedHealthState : Ok
NodeName : Thor2
AggregatedHealthState : Ok
NodeName : Thor1
AggregatedHealthState : Ok
ApplicationHealthStates :
ApplicationName : fabric:/HpcApplication
AggregatedHealthState : Ok
ApplicationName : fabric:/System
AggregatedHealthState : Ok
HealthEvents :
SourceId : System.UpgradeOrchestrationService
Property : ClusterVersionSupport
HealthState : Warning
SequenceNumber : 131982156011361433
SentAt : 3/28/2019 3:00:01 AM
ReceivedAt : 3/28/2019 3:00:31 AM
TTL : 2.00:00:00
Description : The current cluster version 6.3.176.9494 support ends 3/31/2019
12:00:00 AM. Please view available upgrades using
Get-ServiceFabricRegisteredClusterCodeVersion and upgrade using
Start-ServiceFabricClusterUpgrade.
RemoveWhenExpired : True
IsExpired : False
Transitions : Ok->Warning = 3/27/2019 7:25:52 PM, LastError = 1/1/0001 12:00:00 AM
HealthStatistics :
Node : 3 Ok, 0 Warning, 0 Error
Replica : 42 Ok, 0 Warning, 0 Error
Partition : 14 Ok, 0 Warning, 0 Error
Service : 12 Ok, 0 Warning, 0 Error
DeployedServicePackage : 36 Ok, 0 Warning, 0 Error
DeployedApplication : 3 Ok, 0 Warning, 0 Error
Application : 1 Ok, 0 Warning, 0 Error
ApplicationName : fabric:/HpcApplication
AggregatedHealthState : Ok
ServiceHealthStates :
ServiceName : fabric:/HpcApplication/HpcNamingService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/HpcReportingStatefulService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/SessionLauncherStatefulService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/ManagementStatefulService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/ManagementStatelessService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/SdmStatefulService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/MonitoringStatefulService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/SchedulerStatefulService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/WebStatelessService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/DiagnosticsStatefulService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/FrontendStatelessService
AggregatedHealthState : Ok
ServiceName : fabric:/HpcApplication/AzureCommunicatorStatefulService
AggregatedHealthState : Ok
DeployedApplicationHealthStates :
ApplicationName : fabric:/HpcApplication
NodeName : Thor3
AggregatedHealthState : Ok
ApplicationName : fabric:/HpcApplication
NodeName : Thor2
AggregatedHealthState : Ok
ApplicationName : fabric:/HpcApplication
NodeName : Thor1
AggregatedHealthState : Ok
HealthEvents :
SourceId : System.CM
Property : State
HealthState : Ok
SequenceNumber : 130
SentAt : 10/14/2018 1:37:15 AM
ReceivedAt : 3/27/2019 7:43:30 PM
TTL : Infinite
Description : Application has been created.
RemoveWhenExpired : False
IsExpired : False
Transitions : Warning->Ok = 10/14/2018 1:37:15 AM, LastError = 1/1/0001
12:00:00 AM
HealthStatistics :
Replica : 42 Ok, 0 Warning, 0 Error
Partition : 14 Ok, 0 Warning, 0 Error
Service : 12 Ok, 0 Warning, 0 Error
DeployedServicePackage : 36 Ok, 0 Warning, 0 Error
DeployedApplication : 3 Ok, 0 Warning, 0 Error
Thursday, March 28, 2019 11:51 AM -
All HPC services look good. What is the error message when you failed to connect with cluster manager or job manager?Friday, March 29, 2019 2:19 AM
-
only one time did it pup up with an error, and it was a long one. did not writ that one down. the reast of the time it just came up with the screen to enter the servers name, like it could not find the servers.Friday, March 29, 2019 1:20 PM
-
You can first kill the existing HpcClusterManager.exe/HpcJobManager.exe process with task manager if any. And then re-open it again, after input the server names, if it fails to connect, it will finally pop an error message.
And you can share the log files at %localappdata%\Microsoft\Hpc\LogFiles\ClusterManager\HpcClusterManager_<index>.bin
Wednesday, April 3, 2019 8:49 AM -
so it finally happen again.
here is the error info
The connection to the management service failed. detail error: Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> System.ServiceModel.FaultException: The attempt to read or update the store failed. Login failed for user 'UAH\THOR1$'. , inner exception: Login failed for user 'UAH\THOR1$'.
Server stack trace:
at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc)
at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)
Exception rethrown at [0]:
at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
at Microsoft.SystemDefinitionModel.Store.IRemoteSdmStore.QueryDocuments(String filter)
at Microsoft.SystemDefinitionModel.Store.SdmRemoteStore.<>c__DisplayClass45_0.<QueryDocuments>b__0()
at Microsoft.SystemDefinitionModel.Store.SdmRetry.<>c__DisplayClass4_0`1.<InvokeWithRetry>b__0()
at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
--- End of inner exception stack trace ---
at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.SystemDefinitionModel.Store.SdmRetry.InvokeWithRetry[T](Func`1 function, CancellationToken cancellationToken)
at Microsoft.SystemDefinitionModel.ModelDocuments.LoadAllDocuments()
at Microsoft.SystemDefinitionModel.DefinitionSpaceView.TryResolve(String simpleName)
at Microsoft.SystemDefinitionModel.ModelDocuments..ctor(Model model)
at Microsoft.SystemDefinitionModel.Model.InitializeModel()
at Microsoft.ComputeCluster.Management.ClusterManager.ConnectCore(Model model)
at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Hpc.RetryManager.<>c__DisplayClass34_0.<<InvokeWithRetryAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__34.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__90.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsyncWithModel>d__89.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__87.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ComputeCluster.Admin.ConnectionManager.<ConnectManagementService>b__10_0(Task`1 t)know a little bit about some commands you have suggest i am now seeing a health of
AggregatedHealthState : Error
UnhealthyEvaluations :
Unhealthy applications: 100% (1/1), MaxPercentUnhealthyApplications=0%.
Unhealthy application: ApplicationName='fabric:/HpcApplication',
AggregatedHealthState='Error'.
Unhealthy services: 100% (1/1), ServiceType='AzureCommunicatorStatefulServiceType',
MaxPercentUnhealthyServices=0%.
Unhealthy service: ServiceName='fabric:/HpcApplication/AzureCommunicatorStatefulService',
AggregatedHealthState='Error'.
Unhealthy partitions: 100% (3/3), MaxPercentUnhealthyPartitionsPerService=0%.
Unhealthy partition: PartitionId='3615f6ba-cb42-4520-b1b5-7ed7f6ac67a3',
AggregatedHealthState='Error'.
Error event: SourceId='System.FM', Property='State'.
Partition is in quorum loss.
fabric:/HpcApplication/AzureCommunicatorStatefulService 3 2
3615f6ba-cb42-4520-b1b5-7ed7f6ac67a3
S/S Down Thor1 131987888679651288
I/I Down Thor3 131987790403358949
P/P InBuild Thor2 131981801607562483
For more information see: http://aka.ms/sfhealth
Unhealthy partition: PartitionId='2f3b895a-cdb8-4100-a3e9-937074d7867a',
AggregatedHealthState='Error'.
Error event: SourceId='System.FM', Property='State'.
Partition is in quorum loss.
fabric:/HpcApplication/AzureCommunicatorStatefulService 3 2
2f3b895a-cdb8-4100-a3e9-937074d7867a
I/I Standby Thor2 131981815614911389
S/P Down Thor3 131987790403358949
P/S Down Thor1 131981676678782410
For more information see: http://aka.ms/sfhealth
Unhealthy partition: PartitionId='e37fa45f-99cb-4935-a895-f1066af35a43',
AggregatedHealthState='Error'.
Error event: SourceId='System.FM', Property='State'.
Partition is in quorum loss.
fabric:/HpcApplication/AzureCommunicatorStatefulService 3 2
e37fa45f-99cb-4935-a895-f1066af35a43
P/S Down Thor3 131987729059072931
S/P Ready Thor2 131981815614911389
I/I Down Thor1 131987888679495318
For more information see: http://aka.ms/sfhealth
NodeHealthStates :
NodeName : Thor3
AggregatedHealthState : Ok
NodeName : Thor2
AggregatedHealthState : Ok
NodeName : Thor1
AggregatedHealthState : Ok
ApplicationHealthStates :
ApplicationName : fabric:/HpcApplication
AggregatedHealthState : Error
ApplicationName : fabric:/System
AggregatedHealthState : Ok
HealthEvents :
SourceId : System.UpgradeOrchestrationService
Property : ClusterVersionSupport
HealthState : Warning
SequenceNumber : 131987340045440695
SentAt : 4/3/2019 3:00:04 AM
ReceivedAt : 4/3/2019 3:00:34 AM
TTL : 2.00:00:00
Description : The current cluster version 6.3.176.9494 support ends 3/31/2019
12:00:00 AM. Please view available upgrades using
Get-ServiceFabricRegisteredClusterCodeVersion and upgrade using
Start-ServiceFabricClusterUpgrade.
RemoveWhenExpired : True
IsExpired : False
Transitions : Ok->Warning = 3/27/2019 7:25:52 PM, LastError = 1/1/0001 12:00:00 AM
HealthStatistics :
Node : 3 Ok, 0 Warning, 0 Error
Replica : 39 Ok, 0 Warning, 0 Error
Partition : 11 Ok, 0 Warning, 3 Error
Service : 11 Ok, 0 Warning, 1 Error
DeployedServicePackage : 33 Ok, 1 Warning, 2 Error
DeployedApplication : 0 Ok, 1 Warning, 2 Error
Application : 0 Ok, 0 Warning, 1 Error
Wednesday, April 3, 2019 9:26 PM -
now how do i share the bin file with you?Wednesday, April 3, 2019 9:26 PM
-
I saw the error message "Login failed for user 'UAH\THOR1$'.", could you check whether this user is authorized to access the HpcManagement database?
For the log files, you can send to hpcpack@microsoft.com
- Edited by Sunbin Zhu Thursday, April 4, 2019 3:29 AM
- Marked as answer by DCSpooner Thursday, April 4, 2019 11:27 AM
Thursday, April 4, 2019 3:21 AM -
Bingo,
I think I have found it, I do have this environment set up in an always-on SQL environment and the second SQL server does not have the logins in the security DB.
so I have to find a way to sync the security DB between the servers. do you have any suggestions?
thank you so much for your help!!!
Thursday, April 4, 2019 11:26 AM -
We are not experts of SQL Server, i just searched online, here are some links that may help (Note: I didn't try it, just for your reference)
https://social.msdn.microsoft.com/Forums/en-US/f32c1ed2-f3f1-4bb4-9764-b4760a3f7422/sync-users-among-databases-in-always-on?forum=sqldisasterrecovery
https://blog.sqlauthority.com/2017/11/30/sql-server-alwayson-availability-groups-script-sync-logins-replicas/
- Edited by Sunbin Zhu Monday, April 8, 2019 1:39 AM
Monday, April 8, 2019 1:38 AM