none
Race condition between HPCManagement and HPCSdmStore causes slow startup of Windows for HPC Pack 2016 5.1.6086.0

    Question

  • I have found a problem where HPCManagement takes a long time to start up (over 10 minutes) and appear to have found the problem.

    In EventLog I see:

    The HPC Management Service service hung on starting.
    The HPC Session Service service hung on starting.
    The HPC Web Service service hung on starting.


    I can also see in the logs (extracted with hpctrace)

    SDM log:

    03/29/2018    10:55:45.736    i    HpcSdm.exe    1864    1892    [WcfHost] Start WCF endpoint on net.tcp://localhost:9893/Sdm  
    03/29/2018    10:55:45.799    i    HpcSdm.exe    1864    1528    [WcfHost] End to setup WCF channel  
    03/29/2018    10:55:45.814    i    HpcSdm.exe    1864    1528    [WcfHost] Begin to setup WCF channel on net.tcp://localhost:9893/SdmInternal, thumbprint is 76464D1DCD22C0B6E181F6335B57622B41F44A3F  
    03/29/2018    10:55:45.830    i    HpcSdm.exe    1864    1520    [WcfHost] End to setup internal WCF channel  
    03/29/2018    11:07:15.009    i    HpcSdm    1864    1528    [HpcSdm  ] Query documents    

    HPCManagement log:

    03/29/2018    10:55:45.674    i    HpcManagement.exe    2836    2916    [WcfProxy] Begin connect to endpointStr net.tcp://SOMEHOSTNAME:9893/SdmInternal, dnsIdentityName 1.2.3.4  
    03/29/2018    10:55:45.674    i    HpcManagement.exe    2836    2916    [WcfProxy] End to create internal wcf proxy  
    03/29/2018    10:55:45.674    i    HpcManagement    2836    2916    [HpcManagement] Successfully connected to the store on the headnode  
    03/29/2018    11:07:12.963    w    HpcManagement.exe    2836    2916    [WcfProxy] Channel to net.tcp://SOMEHOSTNAME:9893/SdmInternal faulted.  
    03/29/2018    11:07:12.963    i    HpcManagement.exe    2836    2916    [RetryManager] System.ServiceModel.CommunicationException:The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:09:59.4843742'. RetryCount: 1 . Caller: InvokeWithRetry in E:\bt\864787\repo\src\Management\Sdm\sdmcore\Store\SdmRetry.cs:line 31  
    03/29/2018    11:07:14.978    w    HpcManagement    2836    2916    [Store   ] WCF Connection need to be rebuilt.  
    03/29/2018    11:07:14.978    i    HpcManagement    2836    2916    [Store   ] Store is trying to connect to server SOMEHOSTNAME.  
    03/29/2018    11:07:14.978    i    HpcManagement    2836    2916    [Store   ] Connect to SDM through internal connect mode.  
    03/29/2018    11:07:14.978    i    HpcManagement.exe    2836    2916    [WcfProxy] Begin connect to endpointStr net.tcp://SOMEHOSTNAME:9893/SdmInternal, dnsIdentityName 1.2.3.4  
    03/29/2018    11:07:14.978    i    HpcManagement.exe    2836    2916    [WcfProxy] End to create internal wcf proxy  

    You can see that the HpcSdm WCF channel was listening at 10:55:45.830 BUT the HPCManagement attempted to connect 150ms earlier at 10:55:45.674
    This fails, and then the retry waits another 10 minutes before failing, then the next attempt works.
    HpcManagement service correctly has a dependency on HpcSdmStore, but it appears that HpcSdmStore returns with "started" too soon, creating the race condition.
    I have worked around this by configuring HpcManagement windows service to "sc.exe config HpcManagement start= delayed-auto", which means HPCManagement is reachable after about 2 minutes instead of about 12.

    Can you please take a look at the underlying problem?



    • Edited by TimJRoberts1 Thursday, March 29, 2018 12:56 PM formatting
    Thursday, March 29, 2018 12:53 PM

All replies