none
HPC Node Manager Service failed to start during re-install of Head node RRS feed

  • Question

  • Hello,
    I recently started working with our 2012R2 Test Grid, only to find out that someone/something had uninstalled HPC from the Head Node. The cluster nodes and DB node are still there. When reinstalling HPC Pack 2012, everything goes fine (it sees and is able to connect to the Databases), until it get's to "Microsoft HPC Pack 2012 R2 Server Components. It fails with the error:

    "The given service 'HPC Node Manager Service' (HpcNodeManager) failed to start. Please check event log for details"

    If I leave this error up, I can see the services listed in the Services snapin. HPC Node Manager is also there, but not running. If I try to manually start it, it errors with:

    "The HPC Node Manager Service service on Local Computer started and then stopped. Some services stop automatically if they are not in use by other services or programs"

    Neither the Evert Log, nor the installation logs throw any error that indicates why this would be failing.

    Anyone have any insight as to why this won't stay running, or if there is a way to bypass and continue the install?

    Friday, January 19, 2018 6:13 PM

All replies

  • Hi, 

      There should be built-in tool to parse logs: logparser.exe

    And the log should be located under %CCP_DATA%LogFiles, find the right service name. The log file with the biggest index is empty for place holder, thus just pick the one with second biggest index file to get the latest logs


    Qiufang Shi

    Monday, January 22, 2018 9:46 AM
  • Yup, I checked there, but no luck. Here's the command I ran

    And here's the subsequent file output. Note that the HpcSdm_*3.txt is only 1k, despite the .bin file being 5meg. Not sure if that's by design?

    Tuesday, January 30, 2018 10:43 PM
  • The bin file with largest index will always empty as it is a place holder in case disk full. You shall check the HpcSdm_000002.bin instead of 003.

    You shall check the HpcNodeManager log in the hpcscheduler folder.


    Qiufang Shi

    Wednesday, January 31, 2018 6:08 AM
  • Got it. Here's what 2.bin gave me. Looks like it connects up to the DB ok. Not sure why the service won't stay running.

    FYI, this is happening during install. If I click "ok" on the error, it rolls back the install... thus removing the other installed services.

    s,01/30/2018 22:28:33.014, SrcFile="logging.cpp" SrcFunc="Logger::ReadRules" SrcLine="971" Pid="4840" Tid="1680" TS="0x01d39a19a95c8488" String1="Section name: 'LogRules_CSL' (customSectionName = 'CSL'), MaxCrashDumpUsage: 40000000000, SkipCrashDumpRamThresholdPercent: 0. File Configurations: [HpcSdm_CSL: MaxTextFileSize=4194304, MaxDiskUsage=524288000, MemoryMapped=1, EmitBinaryLogs=1 InlineCompressionLevel=6 WriteBufferSize=16384] " 
    i,01/30/2018 22:28:33.014, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="1680" TS="0x01d39a19a95c8488" String1="[HpcSdm  ] Store service initializing" 
    i,01/30/2018 22:28:33.014, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a19a95c8488" String1="TimeIntervalMs=673718,EntriesProcessed=0, BytesProcessed=0, MaxQueuedBytes=0" 
    i,01/30/2018 22:28:33.030, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="1680" TS="0x01d39a19a95ee6c2" String1="[SdmCore ] Creating server channel SdmChannelV4" 
    i,01/30/2018 22:28:33.046, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="1680" TS="0x01d39a19a961492d" String1="[SdmCore ] Creating server sink" 
    i,01/30/2018 22:28:33.046, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="1680" TS="0x01d39a19a961492d" String1="[SdmCore ] Creating server channel SdmChannelV6" 
    i,01/30/2018 22:28:33.046, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="1680" TS="0x01d39a19a961492d" String1="[SdmCore ] Creating server sink" 
    i,01/30/2018 22:28:33.046, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="1680" TS="0x01d39a19a961492d" String1="[HpcSdm  ] Store object created" 
    i,01/30/2018 22:28:33.139, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="1680" TS="0x01d39a19a96f97c0" String1="[HpcSdm  ] The store successfully connected to the SQL database." 
    i,01/30/2018 22:29:33.211, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a19cd3dce8f" String1="TimeIntervalMs=60181,EntriesProcessed=9, BytesProcessed=1463, MaxQueuedBytes=1024" 
    i,01/30/2018 22:30:33.317, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a19f1113528" String1="TimeIntervalMs=60100,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:31:33.480, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1a14ed70da" String1="TimeIntervalMs=60158,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:32:33.581, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1a38c01a05" String1="TimeIntervalMs=60095,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:33:33.700, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1a5c957e41" String1="TimeIntervalMs=60115,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:34:33.814, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1a806a210e" String1="TimeIntervalMs=60112,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:35:33.961, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1aa443dce8" String1="TimeIntervalMs=60145,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:36:34.099, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1ac81c2e35" String1="TimeIntervalMs=60136,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:37:34.238, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1aebf49c18" String1="TimeIntervalMs=60137,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:38:34.380, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1b0fcda3c9" String1="TimeIntervalMs=60145,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:39:34.523, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1b33a6b0fc" String1="TimeIntervalMs=60136,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:40:34.681, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1b57821658" String1="TimeIntervalMs=60157,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:41:34.786, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1b7b556b73" String1="TimeIntervalMs=60104,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:42:34.970, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1b9f34b86d" String1="TimeIntervalMs=60183,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:43:35.101, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1bc30bffe0" String1="TimeIntervalMs=60130,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:44:35.269, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1be6e8eea1" String1="TimeIntervalMs=60167,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:45:35.441, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="4840" Tid="1596" TS="0x01d39a1c0ac66e82" String1="TimeIntervalMs=60171,EntriesProcessed=1, BytesProcessed=112, MaxQueuedBytes=128" 
    i,01/30/2018 22:46:14.461, SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="4840" Tid="3712" TS="0x01d39a1c220856d4" String1="[HpcSdm  ] Store Service shutting down." 
    i,01/30/2018 22:46:14.461, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="4840" Tid="1648" TS="0x01d39a1c220856d4" String1="Current Application Domain ProcessExit event invoked" 
    i,01/30/2018 22:46:14.461, SrcFile="HpcTrace" SrcFunc="" SrcLine="0" Pid="4840" Tid="1648" TS="0x01d39a1c220856d4" String1="Cosmos Logger is being closed" 
    

    Wednesday, January 31, 2018 2:01 PM
  • You shall check the HPCNodeManager log instead of the SDM service log as your installation failure reason is "The HPC Node Manager Service service on Local Computer started"

    Qiufang Shi

    Thursday, February 1, 2018 1:57 AM
  • That is what the above output is from. I'm going having the servers rebuilt.
    Monday, February 5, 2018 5:32 PM