HPC PACK 2016 Update 2 will not start after server reboot RRS feed

  • Question

  • I have installed Server 2016 on a VM Workstation 15.0

    MS Patched

    Installed HPC Pack 2016 Update2

    All looks good. Can connect tothe HPC server using the HPC Cluster Manager.

    All HPC services are running.

    All installations completed as Administrator. Followed steps on Microsoft installing HPC site

    Reboot the Server and the HPC services do not start. Cannot start them manually.

    When try to connect using HPC Cluster Manager get the following alert:

    The connection to the scheduler service failed. detail error: System.AggregateException: One or more errors occurred. ---> Microsoft.Hpc.RetryCountExhaustException: Retry Count of RetryManager is exhausted. ---> System.Net.Http.HttpRequestException: Response status code does not indicate success: 404 (Not Found).

    Have googled this error, but not found. Can anybody advise on this please?

    Have tried recreating the VM many times with slightly different configuration. All results are the same. After a reboot, the HPC services do not start.

    Tuesday, November 6, 2018 10:31 AM

All replies

  • Can you check the service log to see why the service do not start? 

    the logs located under %CCP_DATA%LogFiles\ServiceName\*.bin

    You can use %CCP_HOME%BIN\LogParser.exe to parse the bin log to plain text

    Qiufang Shi

    Tuesday, November 6, 2018 8:44 PM
  • Hi Qiufang Shi,

    Thank you for this suggestion. I have parsed the contents of all LogFiles, but there is nothing there to indicate why they have not started. the last entries are from just after installing the HPC 2016 Pack2 just before the server was restarted. Its almost as if the services were not even tried to be started.

    I have verified all HPC services are set to start Automatically. When restarted, the Eventviewer has errors for each HPC service similar to:

    "A timeout was reached (30000 milliseconds) while waiting for the HpcSdm service to connect."

    Followed shortly after with:

    "The HpcSdm service failed to start due to the following error: 
    The service did not respond to the start or control request in a timely fashion."

    There is nothing in the Security log to indicate a Service does not have permission to run.

    Has anybody tried installing Server 2016 on a VM Workstation 15.0, then adding the HPC Pack 2016 update2?

    Wednesday, November 7, 2018 8:23 AM
  • could you also check the hpcsdm service log? 

    Qiufang Shi

    Wednesday, November 7, 2018 4:12 PM
  • After system restart, the following line is in the HPCSDM log file:

    SrcFile="HpcSdm.exe" SrcFunc="" SrcLine="0" Pid="5072" Tid="2244" TS="0x01d478209ff63f36" String1="Unable to load file SqlConnectionStringProvider.dll. Exception System.IO.FileNotFoundException: Could not load file or assembly 'file:///C:\Program Files\Microsoft HPC Pack 2016\Bin\SqlConnectionStringProvider.dll' or one of its dependencies. The system cannot find the file specified...File name: 'file:///C:\Program Files\Microsoft HPC Pack 2016\Bin\SqlConnectionStringProvider.dll'..   at System.Reflection

    Followed by multiple lines of :

    SrcFile="HpcSdm" SrcFunc="" SrcLine="0" Pid="5072" Tid="2244" TS="0x01d47820c3e228fd" String1="[Store   ] Exception:.System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections.

    As this service cannot start, the other HPC services do not start.

    Verified the SqlConnectionStringProvider.dll file is not in the location specified

    How is this created?

    Friday, November 9, 2018 12:10 PM
  • Thanks Andy,

      the first warning you can ignored. but the second is the real error you're facing. The service can't establish connection to the management database. Could you check why?

    the database connection string should be loaded from HKLM\Software\Microsoft\HPC\Security\ManagementDbConnectionString

    And we will try to use the headnode's machine account to connect to the SQL database

    Qiufang Shi

    Friday, November 9, 2018 6:49 PM
  • I did see this happen intermittently on vm especially with low configuration of cpu and ram. Because of pool performance of vm. SQL start slow which is required by couple of hpc services.

    Please check your vm setting to see it matches the system requirement of hpc2016r2update2.  I think it's 8core and 16GB ram.

    Monday, November 12, 2018 1:36 AM
  • Hi Chenling,

    Thank you for your time on this and the details provided. I have verified the SQL Server (COMPUTECLUSTER) state is running. The SQL Server AGENT(COMPUTECLUSTER) has state Stopped. SQL Server Browser is running

    We recreated the SqlConnectionStringProvider.dll file using details downloaded from here: download.microsoft.com/download/B/D/B/.../HpcSqlConnectionStringPlugin.pdf

    When placed in the location above and the server restarted, some of the HPC services started. However the HPC monitoring Client Service, HPC Monitoring Server Service, HPC Web Service and HPC Session Service did not start

    We can now connect to the Server using the HPC JOB Manager from a workstation on the network.

    I will check the VM settings and update these if required.



    Monday, November 12, 2018 11:05 AM
  • Hi Andy,

    Sorry for the late reply.

    You could just ignore the warning related with SqlConnectionStringProvider.dll, that's for another feature to enable customizing SQL connection string.

    As Richard mentioned, the problem may be due to not enough resource in VM. Could you have a try by re-deploying the cluster with a head node of at least 8 cores and 16GB memory ?


    Wednesday, November 28, 2018 6:37 AM
  • Hi Chenling,

    This turned out to be a resource issue. Increased the RAM and cores on the VM to 16GB and 8 resp. The HPC system still fails to start. However this was traced to the SQL Server service not starting at reboot time. Once this is started (manually) the HPC services can then be started and we can run jobs.

    Now its just a case of configuring templates for different jobs....

    Thank you for your many updates on this. It was much appreciated that someone replied.

    Friday, November 30, 2018 2:04 PM
  • Changed the WCF service to just run as local system account instead of specifying it in the service panel.

    Sunday, November 10, 2019 6:15 PM