none
Responsibilities of HPC diagnostic service

    Frage

  • Dear all,

    What are the responsibilities of HPC diagnostic service? If this service fails, what issues we are expected to encounter.  I am asking this question because in one of our environments this service fails, and we are seeing some weird behavior like jobs are showing executing state which they are not executing. I will followup diagnostic failure issue in a separate thread though. 

    Thanks,

    Puneet


    Puneet Sharma

    Montag, 12. März 2018 14:39

Alle Antworten

  • Hi Sharma,

      Diagnostics service fails won't impact the system, it just that you can't run HPC diagnostics jobs such as pingpong. 

      jobs showing executing state but they are not executing sounds a critical issue. Please examine the logs to check what happened.


    Qiufang Shi

    Dienstag, 13. März 2018 02:16
  • Dear Quifang,

    I have done the following test and observed diagnostic service is crashing :( and cluster manager is not able to connect to it.

    HPC version: HPC 2016 update 1 (version 5.1.6086.0)

    Test Description

    • Install hpc head node and one compute node. All hpc related databases are created on the remote sql server. HPC access sql server via "Sql server authentication". Let's call this sql server login "hpctest".
    • Verify that the system works fine.
    • Now go to the sql server and change the login of "hpctest" user.
    • Now wait for 5-10 minutes to let diagnostic service crashed.
    • Diagnostic service eventually crashes with below event log.

    Application: HpcDiagnostics.exe

    Framework Version: v4.0.30319
    Description: The process was terminated due to an unhandled exception.
    Exception Info: Microsoft.Hpc.Diagnostics.DiagnosticException
       at Microsoft.Hpc.Diagnostics.RemoteCallResult.ThrowIfFailed()
       at Microsoft.Hpc.Diagnostics.DiagnosticRunCollection.Init(Microsoft.Hpc.Diagnostics.DiagnosticStore, System.Collections.Generic.IEnumerable`1<Microsoft.Hpc.Diagnostics.FilterItem>, System.Collections.Generic.IEnumerable`1<Microsoft.Hpc.Diagnostics.SortItem>, Boolean)
       at Microsoft.Hpc.Diagnostics.DiagnosticStore.GetTestRunCollection(System.Collections.Generic.IEnumerable`1<Microsoft.Hpc.Diagnostics.FilterItem>, System.Collections.Generic.IEnumerable`1<Microsoft.Hpc.Diagnostics.SortItem>)
       at Microsoft.Hpc.Diagnostics.Controller.StateHandlerBase.Execute()
       at Microsoft.Hpc.Diagnostics.Controller.DiagnosticsController.RunStateHandlers(System.Object, System.Threading.CancellationToken)
       at Microsoft.Hpc.Diagnostics.Controller.DiagnosticsController+<Start>d__18.MoveNext()
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
       at Microsoft.Hpc.Diagnostics.DiagnosticsSvc+<StartSvc>d__8.MoveNext()
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
       at DiagnosticsWinService.DiagnosticsWinService+<<OnStart>b__2_0>d.MoveNext()
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
       at System.Threading.ThreadPoolWorkQueue.Dispatch()
    • Now go ahead and revert the "hpctest" user password to the older/correct value.
    • Now open cluster manager UI and observe the below error message


    • Now manually start the diagnostic service and verify that the system works. 

    Expected result: Diagnostic service should never crash.

    Thanks,

    Puneet


    Puneet Sharma

    Dienstag, 13. März 2018 20:23
  • We shall still be able to launch the cluster manager GUI even the diagnostics service is down but grey out the diagnostics pane. We will look into the improvement.

    Qiufang Shi

    Freitag, 16. März 2018 05:38
  • Thanks Quifang. Do you have any tentative timelines of this fix? Please let me know.

    Puneet Sharma

    Freitag, 16. März 2018 20:06