none
Single Node not reporting metrics

    Question

  • I have a single compute node that is not reporting metrics (cpu usage/memory usage/etc.) back to the head node.

    It's running the exact same identical image as 59 other nodes.  I've verified that the HPC services are running.

    Any ideas?

    Thanks!


    Ahren Simmons

    Monday, September 28, 2015 2:54 PM

Answers

  • This problem has been solved.  Thanks to your help, I realized that performance counters were not working, even locally through perform.  With some google searching, I happened across this:

    https://support.microsoft.com/en-us/kb/300956

    Running "lodctr /R" (case sensitive) brought the counters back online, then a restart of the monitoring service, and boom, here come the metrics.

    Thanks again for pointing me in the right direction!


    Ahren Simmons

    • Marked as answer by Ahren Simmons Tuesday, September 29, 2015 12:54 PM
    Tuesday, September 29, 2015 12:54 PM

All replies

  • What version of HPC Pack are you using? You can check whether the monitoring service on the compute node whether running correctly.

    And if it is v4 and later, you can check the monitoring logs under %CCP_DATA%LogFiles\Monitoring (you can use tool: "hpctrace parselog *.bin" to parse the log to plain text)


    Qiufang Shi

    Monday, September 28, 2015 8:31 PM
  • We are using HPC Pack 2012 R2.  I verified the monitoring service is running.

    thanks for the information on the logs!

    Inside the log files, the following is repeating basically every second:

    09/24/2015 06:40:34.513 v HpcMonitoringClient 2024 8552 Sampling counters  
    09/24/2015 06:40:34.513 w HpcMonitoringClient 2024 8552 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:35.466 v HpcMonitoringClient 2024 5468 Sampling counters  
    09/24/2015 06:40:35.466 w HpcMonitoringClient 2024 5468 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:36.419 v HpcMonitoringClient 2024 8792 Sampling counters  
    09/24/2015 06:40:36.419 w HpcMonitoringClient 2024 8792 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:37.372 v HpcMonitoringClient 2024 8552 Sampling counters  
    09/24/2015 06:40:37.372 w HpcMonitoringClient 2024 8552 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:38.325 v HpcMonitoringClient 2024 5468 Sampling counters  
    09/24/2015 06:40:38.325 w HpcMonitoringClient 2024 5468 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:39.279 v HpcMonitoringClient 2024 8792 Sampling counters  
    09/24/2015 06:40:39.279 w HpcMonitoringClient 2024 8792 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:40.248 v HpcMonitoringClient 2024 8552 Sampling counters  
    09/24/2015 06:40:40.248 w HpcMonitoringClient 2024 8552 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:41.201 v HpcMonitoringClient 2024 5468 Sampling counters  
    09/24/2015 06:40:41.201 w HpcMonitoringClient 2024 5468 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:42.154 v HpcMonitoringClient 2024 8792 Sampling counters  
    09/24/2015 06:40:42.154 w HpcMonitoringClient 2024 8792 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:43.107 v HpcMonitoringClient 2024 8552 Sampling counters  
    09/24/2015 06:40:43.107 w HpcMonitoringClient 2024 8552 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:44.060 v HpcMonitoringClient 2024 5468 Sampling counters  
    09/24/2015 06:40:44.060 w HpcMonitoringClient 2024 5468 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:45.014 v HpcMonitoringClient 2024 8792 Sampling counters  
    09/24/2015 06:40:45.014 w HpcMonitoringClient 2024 8792 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:45.967 v HpcMonitoringClient 2024 8552 Sampling counters  
    09/24/2015 06:40:45.967 w HpcMonitoringClient 2024 8552 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:46.920 v HpcMonitoringClient 2024 5468 Sampling counters  
    09/24/2015 06:40:46.920 w HpcMonitoringClient 2024 5468 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:47.873 v HpcMonitoringClient 2024 8792 Sampling counters  
    09/24/2015 06:40:47.873 w HpcMonitoringClient 2024 8792 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:48.827 v HpcMonitoringClient 2024 8552 Sampling counters  
    09/24/2015 06:40:48.827 w HpcMonitoringClient 2024 8552 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:49.780 v HpcMonitoringClient 2024 5468 Sampling counters  
    09/24/2015 06:40:49.780 w HpcMonitoringClient 2024 5468 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:50.733 v HpcMonitoringClient 2024 8792 Sampling counters  
    09/24/2015 06:40:50.733 w HpcMonitoringClient 2024 8792 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:51.686 v HpcMonitoringClient 2024 8552 Sampling counters  
    09/24/2015 06:40:51.686 w HpcMonitoringClient 2024 8552 Failed to sample counters, result: 2147485653  
    09/24/2015 06:40:52.639 v HpcMonitoringClient 2024 5468 Sampling counters  
    09/24/2015 06:40:52.639 w HpcMonitoringClient 2024 5468 Failed to sample counters, result: 2147485653  


    Ahren Simmons

    Tuesday, September 29, 2015 12:45 PM
  • This problem has been solved.  Thanks to your help, I realized that performance counters were not working, even locally through perform.  With some google searching, I happened across this:

    https://support.microsoft.com/en-us/kb/300956

    Running "lodctr /R" (case sensitive) brought the counters back online, then a restart of the monitoring service, and boom, here come the metrics.

    Thanks again for pointing me in the right direction!


    Ahren Simmons

    • Marked as answer by Ahren Simmons Tuesday, September 29, 2015 12:54 PM
    Tuesday, September 29, 2015 12:54 PM
  • So did you just run 'lodctr/R" command on the node, or did you modify the registry as well per the article?
    Friday, March 10, 2017 2:31 PM