locked
unable to connect ot the HPC Manger or job scheduler RRS feed

  • Question

  • I am unable to open the manager or the job scheduler either as a client or on one of the 3 servers. I am not sure what is going on but it seams like the entire HPC cluster is down.

    in the Event logs I am getting a lot of warnings EventID 23041:

    help!!!!

    Darren

    Wednesday, March 27, 2019 8:19 PM

Answers

  • I saw the error message "Login failed for user 'UAH\THOR1$'.", could you check whether this user is authorized to access the HpcManagement database? 

    For the log files, you can send to hpcpack@microsoft.com

    • Edited by Sunbin Zhu Thursday, April 4, 2019 3:29 AM
    • Marked as answer by DCSpooner Thursday, April 4, 2019 11:27 AM
    Thursday, April 4, 2019 3:21 AM

All replies

  • I see an error of EventID 23040: applicationHostProxy: openProcess:errorcode=E_fail,parentID=6704. fabric will not monitor apphost {some GUID} 
    Wednesday, March 27, 2019 8:21 PM
  • Could you open a PowerShell console on the head node as administrator, and run the following commands?

    Connect-ServiceFabricCluster

    Get-ServiceFabricClusterHealth

    Get-ServiceFabricApplicationHealth -ApplicationName fabric:/HpcApplication

    Thursday, March 28, 2019 5:38 AM
  • well here is my log info, but now i can connect for some reason. i did do windows updates and reboot all servers and then had to do somehting else and had forgotten about it until now. 

    AggregatedHealthState   : Warning
    UnhealthyEvaluations    :
                              Unhealthy event: SourceId='System.UpgradeOrchestrationService',
                              Property='ClusterVersionSupport', HealthState='Warning', ConsiderWarningAsError=false.
                              The current cluster version 6.3.176.9494 support ends 3/31/2019 12:00:00 AM. Please view
                              available upgrades using Get-ServiceFabricRegisteredClusterCodeVersion and upgrade using
                              Start-ServiceFabricClusterUpgrade.
                             
    NodeHealthStates        :
                              NodeName              : Thor3
                              AggregatedHealthState : Ok
                             
                              NodeName              : Thor2
                              AggregatedHealthState : Ok
                             
                              NodeName              : Thor1
                              AggregatedHealthState : Ok
                             
    ApplicationHealthStates :
                              ApplicationName       : fabric:/HpcApplication
                              AggregatedHealthState : Ok
                             
                              ApplicationName       : fabric:/System
                              AggregatedHealthState : Ok
                             
    HealthEvents            :
                              SourceId              : System.UpgradeOrchestrationService
                              Property              : ClusterVersionSupport
                              HealthState           : Warning
                              SequenceNumber        : 131982156011361433
                              SentAt                : 3/28/2019 3:00:01 AM
                              ReceivedAt            : 3/28/2019 3:00:31 AM
                              TTL                   : 2.00:00:00
                              Description           : The current cluster version 6.3.176.9494 support ends 3/31/2019
                              12:00:00 AM. Please view available upgrades using
                              Get-ServiceFabricRegisteredClusterCodeVersion and upgrade using
                              Start-ServiceFabricClusterUpgrade.
                              RemoveWhenExpired     : True
                              IsExpired             : False
                              Transitions           : Ok->Warning = 3/27/2019 7:25:52 PM, LastError = 1/1/0001 12:00:00 AM
                             
    HealthStatistics        :
                              Node                  : 3 Ok, 0 Warning, 0 Error
                              Replica               : 42 Ok, 0 Warning, 0 Error
                              Partition             : 14 Ok, 0 Warning, 0 Error
                              Service               : 12 Ok, 0 Warning, 0 Error
                              DeployedServicePackage : 36 Ok, 0 Warning, 0 Error
                              DeployedApplication   : 3 Ok, 0 Warning, 0 Error
                              Application           : 1 Ok, 0 Warning, 0 Error
                             
    ApplicationName                 : fabric:/HpcApplication
    AggregatedHealthState           : Ok
    ServiceHealthStates             :
                                      ServiceName           : fabric:/HpcApplication/HpcNamingService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/HpcReportingStatefulService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/SessionLauncherStatefulService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/ManagementStatefulService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/ManagementStatelessService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/SdmStatefulService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/MonitoringStatefulService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/SchedulerStatefulService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/WebStatelessService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/DiagnosticsStatefulService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/FrontendStatelessService
                                      AggregatedHealthState : Ok
                                     
                                      ServiceName           : fabric:/HpcApplication/AzureCommunicatorStatefulService
                                      AggregatedHealthState : Ok
                                     
    DeployedApplicationHealthStates :
                                      ApplicationName       : fabric:/HpcApplication
                                      NodeName              : Thor3
                                      AggregatedHealthState : Ok
                                     
                                      ApplicationName       : fabric:/HpcApplication
                                      NodeName              : Thor2
                                      AggregatedHealthState : Ok
                                     
                                      ApplicationName       : fabric:/HpcApplication
                                      NodeName              : Thor1
                                      AggregatedHealthState : Ok
                                     
    HealthEvents                    :
                                      SourceId              : System.CM
                                      Property              : State
                                      HealthState           : Ok
                                      SequenceNumber        : 130
                                      SentAt                : 10/14/2018 1:37:15 AM
                                      ReceivedAt            : 3/27/2019 7:43:30 PM
                                      TTL                   : Infinite
                                      Description           : Application has been created.
                                      RemoveWhenExpired     : False
                                      IsExpired             : False
                                      Transitions           : Warning->Ok = 10/14/2018 1:37:15 AM, LastError = 1/1/0001
                                      12:00:00 AM
                                     
    HealthStatistics                :
                                      Replica               : 42 Ok, 0 Warning, 0 Error
                                      Partition             : 14 Ok, 0 Warning, 0 Error
                                      Service               : 12 Ok, 0 Warning, 0 Error
                                      DeployedServicePackage : 36 Ok, 0 Warning, 0 Error
                                      DeployedApplication   : 3 Ok, 0 Warning, 0 Error
                                     

    Thursday, March 28, 2019 11:51 AM
  • All HPC services look good. What is the error message when you failed to connect with cluster manager or job manager?
    Friday, March 29, 2019 2:19 AM
  • only one time did it pup up with an error, and it was a long one. did not writ that one down. the reast of the time it just came up with the screen to enter the servers name, like it could not find the servers. 
    Friday, March 29, 2019 1:20 PM
  • You can first kill the existing HpcClusterManager.exe/HpcJobManager.exe process with task manager if any. And then re-open it again, after input the server names, if it fails to connect, it will finally pop an error message.

    And you can share the log files at %localappdata%\Microsoft\Hpc\LogFiles\ClusterManager\HpcClusterManager_<index>.bin

    Wednesday, April 3, 2019 8:49 AM
  • so it finally happen again. 

    here is the error info

    The connection to the management service failed. detail error: Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> System.ServiceModel.FaultException: The attempt to read or update the store failed. Login failed for user 'UAH\THOR1$'. , inner exception: Login failed for user 'UAH\THOR1$'.

    Server stack trace: 
       at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc)
       at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
       at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
       at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

    Exception rethrown at [0]: 
       at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
       at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
       at Microsoft.SystemDefinitionModel.Store.IRemoteSdmStore.QueryDocuments(String filter)
       at Microsoft.SystemDefinitionModel.Store.SdmRemoteStore.<>c__DisplayClass45_0.<QueryDocuments>b__0()
       at Microsoft.SystemDefinitionModel.Store.SdmRetry.<>c__DisplayClass4_0`1.<InvokeWithRetry>b__0()
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
       --- End of inner exception stack trace ---
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.SystemDefinitionModel.Store.SdmRetry.InvokeWithRetry[T](Func`1 function, CancellationToken cancellationToken)
       at Microsoft.SystemDefinitionModel.ModelDocuments.LoadAllDocuments()
       at Microsoft.SystemDefinitionModel.DefinitionSpaceView.TryResolve(String simpleName)
       at Microsoft.SystemDefinitionModel.ModelDocuments..ctor(Model model)
       at Microsoft.SystemDefinitionModel.Model.InitializeModel()
       at Microsoft.ComputeCluster.Management.ClusterManager.ConnectCore(Model model)
       at System.Threading.Tasks.Task.Execute()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.RetryManager.<>c__DisplayClass34_0.<<InvokeWithRetryAsync>b__0>d.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__33`1.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.RetryManager.<InvokeWithRetryAsync>d__34.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__90.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsyncWithModel>d__89.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ClusterManager.<ConnectAsync>d__87.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Admin.ConnectionManager.<ConnectManagementService>b__10_0(Task`1 t)

    know a little bit about some commands you have suggest i am now seeing a health of 



    AggregatedHealthState   : Error
    UnhealthyEvaluations    : 
                              Unhealthy applications: 100% (1/1), MaxPercentUnhealthyApplications=0%.
                              
                              Unhealthy application: ApplicationName='fabric:/HpcApplication', 
                              AggregatedHealthState='Error'.
                              
                              Unhealthy services: 100% (1/1), ServiceType='AzureCommunicatorStatefulServiceType', 
                              MaxPercentUnhealthyServices=0%.
                              
                              Unhealthy service: ServiceName='fabric:/HpcApplication/AzureCommunicatorStatefulService', 
                              AggregatedHealthState='Error'.
                              
                              Unhealthy partitions: 100% (3/3), MaxPercentUnhealthyPartitionsPerService=0%.
                              
                              Unhealthy partition: PartitionId='3615f6ba-cb42-4520-b1b5-7ed7f6ac67a3', 
                              AggregatedHealthState='Error'.
                              
                              Error event: SourceId='System.FM', Property='State'.
                              Partition is in quorum loss.
                              fabric:/HpcApplication/AzureCommunicatorStatefulService 3 2 
                              3615f6ba-cb42-4520-b1b5-7ed7f6ac67a3
                                S/S Down Thor1 131987888679651288
                                I/I Down Thor3 131987790403358949
                                P/P InBuild Thor2 131981801607562483
                              
                              For more information see: http://aka.ms/sfhealth
                              
                              Unhealthy partition: PartitionId='2f3b895a-cdb8-4100-a3e9-937074d7867a', 
                              AggregatedHealthState='Error'.
                              
                              Error event: SourceId='System.FM', Property='State'.
                              Partition is in quorum loss.
                              fabric:/HpcApplication/AzureCommunicatorStatefulService 3 2 
                              2f3b895a-cdb8-4100-a3e9-937074d7867a
                                I/I Standby Thor2 131981815614911389
                                S/P Down Thor3 131987790403358949
                                P/S Down Thor1 131981676678782410
                              
                              For more information see: http://aka.ms/sfhealth
                              
                              Unhealthy partition: PartitionId='e37fa45f-99cb-4935-a895-f1066af35a43', 
                              AggregatedHealthState='Error'.
                              
                              Error event: SourceId='System.FM', Property='State'.
                              Partition is in quorum loss.
                              fabric:/HpcApplication/AzureCommunicatorStatefulService 3 2 
                              e37fa45f-99cb-4935-a895-f1066af35a43
                                P/S Down Thor3 131987729059072931
                                S/P Ready Thor2 131981815614911389
                                I/I Down Thor1 131987888679495318
                              
                              For more information see: http://aka.ms/sfhealth
                              
                              
    NodeHealthStates        : 
                              NodeName              : Thor3
                              AggregatedHealthState : Ok
                              
                              NodeName              : Thor2
                              AggregatedHealthState : Ok
                              
                              NodeName              : Thor1
                              AggregatedHealthState : Ok
                              
    ApplicationHealthStates : 
                              ApplicationName       : fabric:/HpcApplication
                              AggregatedHealthState : Error
                              
                              ApplicationName       : fabric:/System
                              AggregatedHealthState : Ok
                              
    HealthEvents            : 
                              SourceId              : System.UpgradeOrchestrationService
                              Property              : ClusterVersionSupport
                              HealthState           : Warning
                              SequenceNumber        : 131987340045440695
                              SentAt                : 4/3/2019 3:00:04 AM
                              ReceivedAt            : 4/3/2019 3:00:34 AM
                              TTL                   : 2.00:00:00
                              Description           : The current cluster version 6.3.176.9494 support ends 3/31/2019 
                              12:00:00 AM. Please view available upgrades using 
                              Get-ServiceFabricRegisteredClusterCodeVersion and upgrade using 
                              Start-ServiceFabricClusterUpgrade.
                              RemoveWhenExpired     : True
                              IsExpired             : False
                              Transitions           : Ok->Warning = 3/27/2019 7:25:52 PM, LastError = 1/1/0001 12:00:00 AM
                              
    HealthStatistics        : 
                              Node                  : 3 Ok, 0 Warning, 0 Error
                              Replica               : 39 Ok, 0 Warning, 0 Error
                              Partition             : 11 Ok, 0 Warning, 3 Error
                              Service               : 11 Ok, 0 Warning, 1 Error
                              DeployedServicePackage : 33 Ok, 1 Warning, 2 Error
                              DeployedApplication   : 0 Ok, 1 Warning, 2 Error
                              Application           : 0 Ok, 0 Warning, 1 Error
                              


    Wednesday, April 3, 2019 9:26 PM
  • now how do i share the bin file with you?
    Wednesday, April 3, 2019 9:26 PM
  • I saw the error message "Login failed for user 'UAH\THOR1$'.", could you check whether this user is authorized to access the HpcManagement database? 

    For the log files, you can send to hpcpack@microsoft.com

    • Edited by Sunbin Zhu Thursday, April 4, 2019 3:29 AM
    • Marked as answer by DCSpooner Thursday, April 4, 2019 11:27 AM
    Thursday, April 4, 2019 3:21 AM
  • Bingo, 

    I think I have found it, I do have this environment set up in an always-on SQL environment and the second SQL server does not have the logins in the security DB. 

    so I have to find a way to sync the security DB between the servers. do you have any suggestions?

    thank you so much for your help!!! 

    Thursday, April 4, 2019 11:26 AM
  • We are not experts of SQL Server, i just searched online, here are some links that may help (Note: I didn't try it, just for your reference)

    https://social.msdn.microsoft.com/Forums/en-US/f32c1ed2-f3f1-4bb4-9764-b4760a3f7422/sync-users-among-databases-in-always-on?forum=sqldisasterrecovery

    https://blog.sqlauthority.com/2017/11/30/sql-server-alwayson-availability-groups-script-sync-logins-replicas/


    • Edited by Sunbin Zhu Monday, April 8, 2019 1:39 AM
    Monday, April 8, 2019 1:38 AM