none
Windows HPC Server 2008 R2 Heat Map for nodes showing constant CPU activity

    Question

  • I have one head node and three compute nodes. All nodes are running Windows HPC Server 2008 R2 with the HPC Pack 2008 R2 installed.

    When the cluster is completely idle and not processing any jobs at all, I can start the HPC Cluster Manager on the head node, open Node Management, then look at the Heat Map for the compute nodes. Every compute node shows constant activity. They cycle continously from 0% to some random number. It could go from 0 to 30% or 0 to 7.99% or 0 to 55%. It does this every second. I have logged in to the compute nodes and watched the task manager and there is nothing running on them. I do not understand what this cpu usage is that the heat map is showing.
    Is there a way to get the heat map to be more "usable" or reliable in what it displays?
    Tuesday, June 7, 2011 6:34 PM

All replies

  • The HPC Heat Map should be showing periodic snapshots of the CPU Usage and/or Disk Throughput and/or whatever you have the Heat Map configured to show. In general, if you have it configured to show CPU Usage, then it should closely match whatever Windows Task Manager reports on the compute nodes. n.b. The Heat Map is not restricted to showing CPU Usage related to HPC jobs, but for everything which is running on the compute node.

    Now, the Windows Task Manager, by default, will only show you the applications which the logged-on user is running. On the "Processes" tab in Windows Task Manager, it also provides a command button labelled "Show processes from all users". Clicking on this button will show the processes from all users, including system processes and services which are always running. You can then add and sort by the "CPU column" which show you which processes are using the most CPU at any one moment in time.

    In addition, the "Performance" tab in Windows Task Manager shows you a real-time graph of CPU Usage. This should closely match what the HPC Heat Map is showing you, although the HPC Heat Map takes CPU Usage snapshots much less frequently than the Windows Task Manager so as not to use up too much network bandwidch.

    Regards,

    Patrick

    Thursday, June 16, 2011 1:13 AM
  • This is the second "dead thread" on this subject with no answer posted.  Hopefully that means it was an easy fix.

    I am seeing the same thing.  If you look on the node with "All Users" enabled there is no actual load on the compute nodes.... at all!  So the heat map is seeing some Phantom Loads, not sure why.  We have 2 independent clusters 128 node clusters doing this, and it never happened under V2 (2008 R1), seems to be a feature of the R2 (v3) version.  We applied the new HPC Service Pack 2 but still seeing phantoms.

     

    Cheers!

    Greg

    Wednesday, December 21, 2011 1:12 AM
  • Hi Greg,

    I'm not sure if I understand you correctly, do you mean that CPU usage in cluster manager has different values from what you see in the task manager if you remote login to the nodes? Background tasks running on the node are also captured in the cluster manager, although these loads are not scheduled by HPC pack, and they could vary from time to time even if your cluster is idle.

     

    Michael

    Thursday, December 22, 2011 9:18 PM
  • I agree with what Greg said. The HPC heat map cpu usage does not even closely match what performance manager shows for the compute nodes on each server.

    If you watch the heat map, you will see wild cpu spikes for each compute node when they are otherwise sitting idle.

    Thursday, December 22, 2011 9:22 PM
  • Hi Brian and Greg,

    I'm trying to repro your problem but so far I'm seeing consistent CPU usage match between the node manager on HN and performance manager on the CN. Could you please provide me with more information of your system?

    1. Which version of HPC pack are your running? Are you running the latest version (v3sp3)?

    2. What is the OS version on your headnode/computenodes?

    3. Is it possible to ask your for a screenshot/video of the problem?

    Thanks.

     

    Michael

    Thursday, December 29, 2011 7:18 PM
  • Hi Michael,

    I am seeing cluster manager report CPU utilization spikes sometimes over 70% on nodes that are idle, and whose task manager shows no CPU utilization even with "Show processes from all users" enabled.

    The Phantom Loads jump between nodes at every refresh.  The same cluster before V3 would look very idle during idle times.

     

     

    Currently running:

    HPC Pack v3sp2, About shows Server and Client Version 3.2.3716.0

    Server = Windows Server HPC Edition (2008 R2) Service Pack 2

    Compute = Windows Server HPC Edition (2008 R2) Service Pack 1

     

    Here is a partial shot of the cluster while everything is idle on these nodes:

     

    I appreciate any isight.  Not critical for us, but it's annoying because I'm used to being able to trust the heat map when monitoring and troubleshooting.

     

    Cheers!

    Greg

     

    Friday, December 30, 2011 9:46 PM
  • I also notice that the Phantom loads only exist when there isn't a job on the node.  Nodes that have some/all cores in use don't seem to have this problem. 
    Saturday, December 31, 2011 3:45 AM
  • HI Greg,

    Thanks for the screenshot,  Could you also get a screenshot of your Windows Task Manager on the node with phantom load at the time?

    When you use task manager, please enable CPU Kernal mode usage by choosing Show kernel times from the View menu, you should see a red line in the diagram after that.


    Thanks again for the cooperation, everything you do will help us better understand the problem.

     

    Michael

    Tuesday, January 3, 2012 5:38 PM
  • Are you running any other applications on your nodes ? This could be AV usage, or any other application monitoring tool ?

     

    You may try to use a tool like xperf that you can find in the Windows SDK inside the 'Windows Performance Tookit' or something similar. You can installed it on the headnode and share the directory so it can be accessed from your compute nodes. 

    Then run this command for the nodes you want to grab the running processes and their CPU usage

    clusrun \\headnode\perfshare\xperf -on HARD_FAULTS+PROC_THREAD

     

    clusrun \\headnode\perfshare\xperf -d c:\%%COMPUTERNAME%%.etl

     

    clusrun xcopy c:\*.etl \\headnode\perfshare\xperfresults\ /D /Y /I

    use xperfview to load the ETL files and verify which processes consumes what.

    Wednesday, January 4, 2012 10:08 AM
  • No other apps are running.  It's a really dull configuration.  If they were real loads I would expect them to happen on nodes that have 50% of the cores in use, but it does not.  The phantom loads only happen on nodes that have no cores assigned.   I'll try and add some real detail later this week.

     

    Cheers!

    Greg

    Monday, January 9, 2012 3:51 AM
  • Hi Michael,

    The problem is less pronounced when I am logged into a node it seems, maxing out around 25% while neighboring nodes hit up to 80% loads.  All the nodes in these 2 shots are completely idle, and their CPU histograms with kernel time on is flatlined...

     

    In this shot I set all the CPU's to show in 1 histogram, again, absolutely no load in task manager yet 20% in heat map...

    Monday, January 9, 2012 4:32 AM
  • And, for reference, here's a node with real load.  We undersubscribe to 4 cores to avoid hyperthreading performance problems, so 50% utilization is perfect utilization in this shot...  Notice there is no phantom loading at all on neighboring nodes as they are all in use by the current job.  This is how all 250 loaded nodes look when the job is running.  As soon as the last tasks exit a node it starts showing the phantom loads again until real work comes it's way.


    Monday, January 9, 2012 4:41 AM
  • Thank you Greg for the details, I'm working on a repro right now and will get back to you once I find something, thanks:)

     

    Michael

    Monday, January 9, 2012 5:57 PM
  • Hi Greg,

    I have a repro for your problem and we are investigating the cause now.

    Thank you for bringing this up, it's users like you that will make HPC better and better, we really appreciate it.

     

    Michael

    Tuesday, January 10, 2012 5:45 PM
  • Hi Greg,

    After further investigation, we found out that the load is there but they are not phantom.

    HPC uses the perfmon API from windows to report CPU loads, which is different from Task Manager, please see this article for details http://support.microsoft.com/kb/810876

    To do a simple experiment, you can start performance monitor and capture total CPU usage for 1 minute, and do sampling every second, then compare your captured performance log with what you saw in the HPC node manager, and you should find consistency in there.

    Thanks again for contacting us.

     

    Michael

     

    Tuesday, January 10, 2012 10:04 PM
  • Thanks for the update.  I will try to get a look at this next week.
    Friday, January 13, 2012 11:59 PM
  • I haven't had a chance to confirm this is the discrepancy by looking at permon but even if it is the case, it's still doesn't make sense to me.  If this was the real reason for the variation shown in the heat map I would expect to see those "Spikes" or phantom loads even when the node is processing a job. 

    As soon as 1 core is in use, the nodes' load becomes absolutely steady.  I will try to get the corresponding graphs as you have suggested but intellectually I am not sure it should matter.  Will try to run HPL or something to show the correlation.

    Cheers,

    Greg

     

     

    Wednesday, January 25, 2012 8:02 PM

  • Here's a capture to demonstrate even perfmon is not showing the same phantom loads on Node249.  In this case HPC is showing 36.5% load, while perfmon is nowhere near that and hasn't been forover 30 seconds.  I confirmed it wasn't lag where perfmon showed the same spike seconds later.


    The thing that is making me crazy is, if I am not logged into the machine the spikes are much higher and more frequent, once I login and start monitoring it gets much less frequent and the spikes are lower.  Once a job starts it goes away all together.  Could be some ultra-low priority task that only runs when there's no other load... I just can't imagine anything that low of a priority running so frequently.

    Thanks for any insight on how to better capture this.

    Cheers!

    Greg

    Saturday, February 18, 2012 6:38 PM