locked
HPC Pack 2016-3 Headnode services will not start RRS feed

  • Question

  • I cannot figure out why my services are not starting.

    HPC Web Service is stuck in the 'Starting' state.

    HPC Management Service is stopped and fails to start

    HPC Session Service is stopped and fails to start

    Event Log shows these errors:
    Error    2/12/2020 7:13:31 PM    Management    6106    Initialization

    Log Name:      Microsoft-HPC-Management/Admin
    Source:        Microsoft-HPC-Management
    Date:          2/12/2020 7:13:31 PM
    Event ID:      6106
    Task Category: Initialization
    Level:         Error
    Keywords:      
    User:          SYSTEM
    Computer:      sv-cube.domain.com
    Description:
    The HPC Management Service failed to initialize correctly: Can not find cert with thumbprint 63EBF6D3A126DE5286C90B291832E3F8752D27E6 in store My, LocalMachine, inner exception: Can not find cert with thumbprint 63EBF6D3A126DE5286C90B291832E3F8752D27E6 in store My, LocalMachine
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-HPC-Management" Guid="{C05330A2-E7F0-4730-8890-589DDD685F84}" />
        <EventID>6106</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>3</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2020-02-13T02:13:31.567672700Z" />
        <EventRecordID>70224</EventRecordID>
        <Correlation />
        <Execution ProcessID="18328" ThreadID="18360" />
        <Channel>Microsoft-HPC-Management/Admin</Channel>
        <Computer>sv-cube.domain.com</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
        <Data Name="ExceptionMessage">Can not find cert with thumbprint 63EBF6D3A126DE5286C90B291832E3F8752D27E6 in store My, LocalMachine, inner exception: Can not find cert with thumbprint 63EBF6D3A126DE5286C90B291832E3F8752D27E6 in store My, LocalMachine</Data>
      </EventData>
    </Event>

    Log Name:      System
    Source:        Service Control Manager
    Date:          2/12/2020 7:20:53 PM
    Event ID:      7023
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      sv-cube.domain.com
    Description:
    The HPC Reporting Service service terminated with the following error:
    %%2148734209
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}" EventSourceName="Service Control Manager" />
        <EventID Qualifiers="49152">7023</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8080000000000000</Keywords>
        <TimeCreated SystemTime="2020-02-13T02:20:53.480809800Z" />
        <EventRecordID>113399</EventRecordID>
        <Correlation />
        <Execution ProcessID="2136" ThreadID="3184" />
        <Channel>System</Channel>
        <Computer>sv-cube.domain.com</Computer>
        <Security />
      </System>
      <EventData>
        <Data Name="param1">HPC Reporting Service</Data>
        <Data Name="param2">%%2148734209</Data>
        <Binary>4800700063005200650070006F007200740069006E0067000000</Binary>
      </EventData>
    </Event>

    Log Name:      Windows HPC Server
    Source:        Microsoft-HPC-Reporting
    Date:          2/12/2020 7:20:53 PM
    Event ID:      7
    Task Category: None
    Level:         Warning
    Keywords:      
    User:          SYSTEM
    Computer:      sv-cube.domain.com
    Description:
    Cannot connect to HPC SDM Service. Will retry later.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-HPC-Reporting" Guid="{CF1DAF61-3B25-49EE-A103-2031A6DA446C}" />
        <EventID>7</EventID>
        <Version>0</Version>
        <Level>3</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x2000000000000000</Keywords>
        <TimeCreated SystemTime="2020-02-13T02:20:53.481704800Z" />
        <EventRecordID>2919</EventRecordID>
        <Correlation />
        <Execution ProcessID="6232" ThreadID="18196" />
        <Channel>Windows HPC Server</Channel>
        <Computer>sv-cube.domain.com</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
      </EventData>
    </Event>

    I never changed this registry key and the server has been operating in this configuration for over a year without issues.

    We have one Compute Node as well that all services are starting correctly on.

    Any help would be greatly appreciated.


    -Scott

    Thursday, February 13, 2020 2:41 AM

All replies

  • Hi Scott,

    It looks a certificate issue according to the error event log. Could you double check if the certificate required for HPC service communication does exist in the cert store. If not (probably expired or deleted), you may follow this online doc to renew this communication cert.

    The HPC Management Service failed to initialize correctly: Can not find cert with thumbprint 63EBF6D3A126DE5286C90B291832E3F8752D27E6 in store My, LocalMachine,

    Regards,

    Yutong Sun

    Thursday, February 20, 2020 4:02 AM