I have found a problem where HPCManagement takes a long time to start up (over 10 minutes) and appear to have found the problem.
In EventLog I see:
The HPC Management Service service hung on starting.
The HPC Session Service service hung on starting.
The HPC Web Service service hung on starting.
I can also see in the logs (extracted with hpctrace)
SDM log:
03/29/2018 10:55:45.736 i HpcSdm.exe 1864 1892 [WcfHost] Start WCF endpoint on net.tcp://localhost:9893/Sdm
03/29/2018 10:55:45.799 i HpcSdm.exe 1864 1528 [WcfHost] End to setup WCF channel
03/29/2018 10:55:45.814 i HpcSdm.exe 1864 1528 [WcfHost] Begin to setup WCF channel on net.tcp://localhost:9893/SdmInternal, thumbprint is 76464D1DCD22C0B6E181F6335B57622B41F44A3F
03/29/2018
10:55:45.830 i HpcSdm.exe 1864 1520 [WcfHost] End to setup internal WCF channel
03/29/2018 11:07:15.009 i HpcSdm 1864 1528 [HpcSdm ] Query documents
HPCManagement log:
03/29/2018 10:55:45.674 i HpcManagement.exe 2836 2916 [WcfProxy] Begin connect to endpointStr net.tcp://SOMEHOSTNAME:9893/SdmInternal, dnsIdentityName 1.2.3.4
03/29/2018 10:55:45.674 i HpcManagement.exe 2836 2916 [WcfProxy] End to create internal wcf proxy
03/29/2018
10:55:45.674 i HpcManagement 2836 2916 [HpcManagement] Successfully connected to the store on the headnode
03/29/2018 11:07:12.963 w HpcManagement.exe 2836 2916 [WcfProxy] Channel to net.tcp://SOMEHOSTNAME:9893/SdmInternal faulted.
03/29/2018 11:07:12.963 i HpcManagement.exe 2836 2916 [RetryManager] System.ServiceModel.CommunicationException:The socket connection was aborted. This could be caused by an error
processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:09:59.4843742'. RetryCount: 1 . Caller: InvokeWithRetry in E:\bt\864787\repo\src\Management\Sdm\sdmcore\Store\SdmRetry.cs:line
31
03/29/2018 11:07:14.978 w HpcManagement 2836 2916 [Store ] WCF Connection need to be rebuilt.
03/29/2018 11:07:14.978 i HpcManagement 2836 2916 [Store ] Store is trying to connect to server SOMEHOSTNAME.
03/29/2018 11:07:14.978 i HpcManagement 2836 2916 [Store ] Connect to SDM through internal connect mode.
03/29/2018 11:07:14.978 i HpcManagement.exe 2836 2916 [WcfProxy] Begin connect to endpointStr net.tcp://SOMEHOSTNAME:9893/SdmInternal, dnsIdentityName 1.2.3.4
03/29/2018 11:07:14.978 i HpcManagement.exe 2836 2916 [WcfProxy] End to create internal wcf proxy
You can see that the HpcSdm WCF channel was listening at 10:55:45.830 BUT the HPCManagement attempted to connect 150ms earlier at 10:55:45.674
This fails, and then the retry waits another 10 minutes before failing, then the next attempt works.
HpcManagement service correctly has a dependency on HpcSdmStore, but it appears that HpcSdmStore returns with "started" too soon, creating the race condition.
I have worked around this by configuring HpcManagement windows service to "sc.exe config HpcManagement start= delayed-auto", which means HPCManagement is reachable after about 2 minutes instead of about 12.
Can you please take a look at the underlying problem?