none
CCP_SERVICE_HANG_TIMEOUT -1 RRS feed

  • Question

  • Hi,

    I've noticed that one of my HpcServiceHost.exe processes was still running 15 days after it first started. From it's environment variables list I see that CCP_SERVICE_HANG_TIMEOUT is set to -1. Can I change it to be 12hours or something less than Infinity (assuming -1 means Infinity in this case). Any other ideas to somehow automatically kill such zombie service hosts. The job had successfully finished on day it started but somehow this zombie service host is still running and just doing some basic cosmos logging. 

    Thanks in advance.

    Tuesday, April 30, 2019 6:15 AM

All replies

  • Hi EjazAhmed,

    The CCP_SERVICE_HANG_TIMEOUT is triggered when the service host has processing requests but no new response or request is sending or receiving. It may suggest the service code or the service host is hanging there. In your case, since all requests were processed and the service job was finished, the CCP_SERVICE_HANG_TIMEOUT should not take effect. Normally there should not be a runaway HpcServiceHost.exe on the compute node once the session job is finished. Could you help collect the following info and send them to hpcpack@microsoft.com so that we can futher investigate this issue?

    1. Indicate the HPC Pack version by HPC Cluster Manager -> Help -> About

    2. Indicate the session job Id which leaks the runaway HpcServiceHost.exe. If the process is still there, you may use Process Explorer tool to check the job/task Ids the service host belongs from the environment variables.

    3. Collect the job scheduler service logs for HpcScheduler_*.bin files under %CCP_DATA%LogFiles\Scheduler on the head node, and the node manager service logs for HpcNodeManager_*.bin files under %CCP_DATA%LogFiles\Scheduler on the compute node.

    Regards,

    Yutong Sun

     

    Sunday, May 5, 2019 5:49 AM
    Moderator