none
Heat Map - Compute Nodes stop displaying info RRS feed

  • Question

  • We are running HPC Pack 2012 R2 4.5.5079.0 server version / 4.5.4095.0 client version.  Recently moved HPC DBs from head node server off SQL Express to dedicated SQL 2014 server.  DB server sits in difference VLAN than head node and rest of compute nodes.

    When all grid servers rebooted, all compute nodes (13) display stats in Heat Map ... after awhile (a couple of hours - a day) the compute nodes just stop displaying stats and we're left with just X's in the boxes.  Only 1 compute node remains displaying stats.    Restarting the HPCManagement Service or the HPC Monitoring Client Service on the compute nodes does nothing.  Can't find anything in Event Logs as to why.  In the HPCMonitoringServer log, I see when it happens as the compute node entries change from:

    652 Received packet from node SERVERNAMEXXX with version 1  
    652 Callback received UDP Packet  
    652 Listening for packets, buffer size is 1024  

    to

    18004 Entering Persist  
    18004 Persisting 15 nodes  
    18004 Persisting node: SERVERNAMEXXX  
    18004 Entering UpdateCalculatedMetrics  
    18004 Calculating cluster metrics time taken: 0 milliseconds, nodes processed: 15  
    18004 UMID: 1 | 1 : 60 instance values  
    18004 UMID: 2 | 0 : 60 instance values  
    18004 UMID: 3 | 1 : 60 instance values  
    18004 UMID: 4 | 0 : 60 instance values  
    18004 UMID: 5 | 0 : 60 instance values  
    18004 UMID: 6 | 0 : 60 instance values  
    18004 UMID: 7 | 1 : 60 instance values  
    18004 UMID: 11 | 26 : 60 instance values  
    18004 UMID: 11 | 27 : 60 instance values  
    18004 UMID: 11 | 28 : 60 instance values  
    18004 UMID: 11 | 3 : 60 instance values  
    18004 UMID: 11 | 29 : 60 instance values  
    18004 UMID: 11 | 1 : 60 instance values  
    18004 UMID: 11 | 2 : 60 instance values  
    18004 UMID: 11 | 30 : 60 instance values  
    18004 UMID: 11 | 31 : 60 instance values  
    18004 UMID: 12 | 32 : 60 instance values  

    The compute nodes continue to process jobs just fine ... we just lose their stats in Heat Map, which is important piece to our business users.

    

    

    Wednesday, March 8, 2017 9:43 PM

All replies

  • Hi

    you client version is 4.5.4095? suppose it is 4.5.5094, we don't have official version of 4.5.4095

    and where do you run client, on head node or one client machine?

    whether head node's heatmap is shown?

    Can you try the following options:

    1, restart HpcClusterManager

    2, restart HpcMonitoringServer service on head node

    If still cannot work, please send the following log to us through email hpcpack@microsoft.com 

    1, HpcMonitoringServer log on head node under C:\Program Files\Microsoft HPC Pack 2012\Data\LogFiles\Monitoring, file name like HpcMonitoringServer_*.bin

    2, HpcMonitoringClient log on one compute node (has heatmap issue) under C:\Program Files\Microsoft HPC Pack 2012\Data\LogFiles\Monitoring, file name like HpcMonitoringClient_*.bin

    3, HpcClusterManager log on the machine you run client(UI), under %HOMEPATH%AppData\Local\Microsoft\Hpc\LogFiles\ClusterManager

    Thursday, March 9, 2017 2:11 AM
  • Sorry, yes, client is 4.5.5094 (typo).

    Client is installed on user's desktops.  But when we bring up cluster manager on the server, you can't see the compute nodes stats there either.

    1.  Restart HPCClusterManager - have done that multiple times on server and client desktops

    2.  Restart HPCMonitoringServer Service on head node - Didn't change view.  All I see in Heat Map is the head node and a SQL DB server (not the HPC DB SQL Server) used for one of the applications that they run ETL jobs on.  The rest of the compute nodes (12) have no stats.

    Emailed log files at 10:42 a.m. CT today.

    Thanks!

    Thursday, March 9, 2017 4:43 PM
  • Hi,

    For HpcMonitoringClient log, we can see the following warning

    Order DateTime ThreadId ProcessId Level SrcFile Content
    3/4/2017 1:23:40 AM 11348 1828 HpcMonitoringClient

    Failed to sample counters, result: 2147485653

    seems it is the issue related to system performance counter, you can refer to this link

    https://social.microsoft.com/Forums/en-US/65051eb0-93f0-4b95-9511-06f1313f55f7/single-node-not-reporting-metrics?forum=windowshpcitpros

    Friday, March 10, 2017 2:07 AM
  • So, we did a monthly reboot early this morning ... All compute nodes reported into Heat map, we could see stats ... all good UNTIL about 853 a.m.  Then we lost a couple nodes from view.  After awhile we lost a few more ... we are now down to 5/12 only showing in Heat Map.  One thing we have noticed in looking at the jobs running on the grid at the time we lost the node info, the CPU on those nodes were at 100%.  Once the CPU drops below 100%, it still doesn't come back into heat map view.  We tried to take a node offline/online and see if that would make it show up again, did not work.  It seems only a reboot of the compute node brings it back into Heat map view

    Any ideas why the CPU hitting 100% might affect the collection of performance data at the head node?

    Wednesday, March 22, 2017 6:22 PM
  • Hi,

    When compute node lost the heatmap, can you login to that compute node, and open "Perfmon.exe", and add counters, to see whether can should perf counter?

    BTW, as your server version is 4.5.5079.0, what is the version of compute node, do you install some patch on compute node?

    Thursday, March 23, 2017 1:51 AM
  • Yes, I can log into computer node, open Perfmon and see perf counters there.

    So, as a refresher, we rebooted the entire grid last Wednesday morning around 7:30 a.m.   All nodes reported in the heat map after reboot, but started dropping out.  By 10:45 a.m. the same morning, we'd lost them all but a DB server computer node that is used for ETL processes.  In checking all compute nodes and head node, all HPC servers were still running as expected.  We tried restarting some services on the compute nodes to get the compute nodes back, with no success.  There are no event log entries recording the node going out of heat map view.

    The next morning, one of the compute nodes had returned.  Hmmmm???  In looking through the event logs, I noticed one of our Sys Admins had issued a "Bring online" command, even tho it already appeared online as jobs were processing through just fine.  I tried the "bring online" command with other nodes not reporting in heat map, but it failed to bring them back. :-)

    In looking through HPC database info, my co-worker found an edit had been changed to the compute node that came back in the evening.  In talking to the Sys Admin, he had done a right-click on the node, Edit Properties and typed in a description for the node.  Doing that ... brought the node back into heat map view and has remained there since doing so.  We've since done the same thing for all the nodes and they are now all reporting in the heat map.

    So, can you please tell me why something that simple worked, since the node already has a name?  Also, I did notice even tho that seemed to be the fix .... the database server that has remained providing stats in the heat map since reboot, did NOT have a description field completed for it, so why did it remain and the other compute nodes drop out?

    Monday, March 27, 2017 3:47 PM
  • "In looking through HPC database info, my co-worker found an edit had been changed to the compute node that came back in the evening.  In talking to the Sys Admin, he had done a right-click on the node, Edit Properties and typed in a description for the node."

    what is the properties your co-worker changed on that node, do the operation on HpcClusterManager?

    actually, HpcMonitoringClient on compute node is responsible for collecting perf counter and send to HpcMonitoringServer service on head node.

    Tuesday, March 28, 2017 7:10 AM