locked
HPC 2012 R2 nodes view not display correct RRS feed

  • Question

  • Hi,

    Im new to HPC cluster, and we have a customer that is wondering why the Heat Map and CPU usage View does not show any information, is there a service that we need to restart that may have failed ? If we restart any service will that have any impact on the jobs that are runnings ?


    /Regards Andreas

    Monday, November 17, 2014 10:40 PM

Answers

  • seems rebooting the compute node can solve this issue, can you try on some compute node first,
    • Marked as answer by Andreas2012 Tuesday, December 2, 2014 9:26 PM
    Tuesday, December 2, 2014 1:41 AM

All replies

  • Hi again,

    just for information, we did a reboot of one node today, and when it come up we can see CPU usage, core in use etc... so i guess there are some services that need to restart, but will this impact any running jobs, and what services is responsible for the display....


    /Regards Andreas

    Monday, November 17, 2014 10:42 PM
  • Hi, Andreas,

    the performance counters for compute node are collected by "HPC Monitoring Client Service" on the compute node, and it will send the data to "HPC Monitoring Server Servoce" on head node, you can check whether that service is running on the compute node, it should not impact the running jobs.

    Tuesday, November 18, 2014 1:53 AM
  • Hi,

    I checked several servers, and all the nodes are running the service, I tried to restart the service, but same issue. On one of the servers I could see the following error message, that I could not figure out, but guess it has some impact.

    Please advise


    /Regards Andreas

    Tuesday, November 18, 2014 8:19 AM
  • so after you restart "HPC Monitoring Client" service, whether it is ok,

    and you can check the log under %CCP_HOME%\Data\LogFiles\Monitoring,  it should be C:\Program Files\Microsoft HPC Pack 2012\Data\LogFiles\Monitoring

    open one cmd console, run the following command to parse log

    >hpctrace parselog HpcMonitoringClient*.bin

    then take a look at the generated log, whether has some error in the log, it should contains more detail information, please share the detail error info, it should be helpful

    Tuesday, November 18, 2014 8:42 AM
  • Hi

    I was not able to copy all the information due to vpn issues, but it generated a lot of files, and all the files had the same information, as the image below. If you need more information i could try to get the log files. Since im new to HPC, this does not help me..hehe, so thanks for support.


    /Regards Andreas

    Tuesday, November 18, 2014 9:05 AM
  • this error is not same as the error before, error code 258 is system error "The wait operation timed out.", it is a native call, seems cannot get the performance counter data from system.

    Tuesday, November 18, 2014 9:32 AM
  • Hi,

    Not sure what to make of it, i checked the file that was from today, and here are the messages.

    Since the system cannot get the performance counter data from system do i need to restart every node? The one node i did restart is still working and showing performance data, but i guess it will fail after a couple of days...

    11/18/2014 08:07:15.425 w HpcMonitoringClient 1728 4568 Failed to sample counters, result: 258 
    11/18/2014 08:07:15.972 v HpcMonitoringClient 1728 4568 Sampling counters 
    11/18/2014 08:07:16.379 w HpcMonitoringClient 1728 4896 Failed to sample counters, result: 258 
    11/18/2014 08:07:16.930 v HpcMonitoringClient 1728 4896 Sampling counters 
    11/18/2014 08:07:17.336 w HpcMonitoringClient 1728 8376 Failed to sample counters, result: 258 
    11/18/2014 08:07:17.883 v HpcMonitoringClient 1728 8376 Sampling counters 
    11/18/2014 08:07:18.180 i HpcMonitoringClient 1728 7608 Check for metric update and configuration changes 
    11/18/2014 08:07:18.289 w HpcMonitoringClient 1728 5188 Failed to sample counters, result: 258 
    11/18/2014 08:07:18.836 v HpcMonitoringClient 1728 5188 Sampling counters 
    11/18/2014 08:07:19.242 w HpcMonitoringClient 1728 7364 Failed to sample counters, result: 258 
    11/18/2014 08:07:19.789 v HpcMonitoringClient 1728 7364 Sampling counters 
    11/18/2014 08:07:20.195 w HpcMonitoringClient 1728 8628 Failed to sample counters, result: 258 
    11/18/2014 08:07:20.742 v HpcMonitoringClient 1728 8628 Sampling counters 
    11/18/2014 08:07:21.149 w HpcMonitoringClient 1728 7904 Failed to sample counters, result: 258 
    11/18/2014 08:07:21.696 v HpcMonitoringClient 1728 7904 Sampling counters 
    11/18/2014 08:07:22.102 w HpcMonitoringClient 1728 4956 Failed to sample counters, result: 258 
    11/18/2014 08:07:22.649 v HpcMonitoringClient 1728 4956 Sampling counters 
    11/18/2014 08:07:23.055 w HpcMonitoringClient 1728 3640 Failed to sample counters, result: 258 
    11/18/2014 08:07:23.602 v HpcMonitoringClient 1728 7608 Sampling counters 
    11/18/2014 08:07:24.008 w HpcMonitoringClient 1728 3720 Failed to sample counters, result: 258 
    11/18/2014 08:07:24.555 v HpcMonitoringClient 1728 3720 Sampling counters 
    11/18/2014 08:07:24.961 w HpcMonitoringClient 1728 4688 Failed to sample counters, result: 258 
    11/18/2014 08:07:25.508 v HpcMonitoringClient 1728 4688 Sampling counters 
    11/18/2014 08:07:25.915 w HpcMonitoringClient 1728 7320 Failed to sample counters, result: 258 
    11/18/2014 08:07:26.461 v HpcMonitoringClient 1728 7320 Sampling counters 
    11/18/2014 08:07:26.868 w HpcMonitoringClient 1728 4464 Failed to sample counters, result: 258 
    11/18/2014 08:07:27.415 v HpcMonitoringClient 1728 4464 Sampling counters 
    11/18/2014 08:07:27.821 w HpcMonitoringClient 1728 7800 Failed to sample counters, result: 258 
    11/18/2014 08:07:28.368 v HpcMonitoringClient 1728 7800 Sampling counters 
    11/18/2014 08:07:28.774 w HpcMonitoringClient 1728 8236 Failed to sample counters, result: 258 
    11/18/2014 08:07:29.321 v HpcMonitoringClient 1728 3640 Sampling counters 
    11/18/2014 08:07:29.727 w HpcMonitoringClient 1728 944 Failed to sample counters, result: 258 
    11/18/2014 08:07:30.274 v HpcMonitoringClient 1728 944 Sampling counters 
    11/18/2014 08:07:30.681 w HpcMonitoringClient 1728 5760 Failed to sample counters, result: 258 
    11/18/2014 08:07:31.227 v HpcMonitoringClient 1728 5760 Sampling counters 
    11/18/2014 08:07:31.634 w HpcMonitoringClient 1728 8964 Failed to sample counters, result: 258 
    11/18/2014 08:07:32.181 v HpcMonitoringClient 1728 8964 Sampling counters 
    11/18/2014 08:07:32.587 w HpcMonitoringClient 1728 3468 Failed to sample counters, result: 258 
    11/18/2014 08:07:33.134 v HpcMonitoringClient 1728 3468 Sampling counters 
    11/18/2014 08:07:33.540 w HpcMonitoringClient 1728 4368 Failed to sample counters, result: 258 
    11/18/2014 08:07:34.087 v HpcMonitoringClient 1728 4368 Sampling counters 
    11/18/2014 08:07:34.493 w HpcMonitoringClient 1728 4488 Failed to sample counters, result: 258 
    11/18/2014 08:07:35.040 v HpcMonitoringClient 1728 4488 Sampling counters 
    11/18/2014 08:07:35.446 w HpcMonitoringClient 1728 8296 Failed to sample counters, result: 258 
    11/18/2014 08:07:35.993 v HpcMonitoringClient 1728 8296 Sampling counters 
    11/18/2014 08:07:36.400 w HpcMonitoringClient 1728 7940 Failed to sample counters, result: 258 
    11/18/2014 08:07:36.947 v HpcMonitoringClient 1728 7940 Sampling counters 
    11/18/2014 08:07:37.353 w HpcMonitoringClient 1728 9140 Failed to sample counters, result: 258 
    11/18/2014 08:07:37.900 v HpcMonitoringClient 1728 9140 Sampling counters 
    11/18/2014 08:07:38.306 w HpcMonitoringClient 1728 1232 Failed to sample counters, result: 258 
    11/18/2014 08:07:38.853 v HpcMonitoringClient 1728 1232 Sampling counters 
    11/18/2014 08:07:39.259 w HpcMonitoringClient 1728 8284 Failed to sample counters, result: 258 
    11/18/2014 08:07:39.806 v HpcMonitoringClient 1728 8284 Sampling counters 
    11/18/2014 08:07:40.212 w HpcMonitoringClient 1728 5524 Failed to sample counters, result: 258 
    11/18/2014 08:07:40.759 v HpcMonitoringClient 1728 5524 Sampling counters 
    11/18/2014 08:07:41.166 w HpcMonitoringClient 1728 2576 Failed to sample counters, result: 258 
    11/18/2014 08:07:41.713 v HpcMonitoringClient 1728 2576 Sampling counters 
    11/18/2014 08:07:42.119 w HpcMonitoringClient 1728 7440 Failed to sample counters, result: 258 
    11/18/2014 08:07:42.666 v HpcMonitoringClient 1728 7440 Sampling counters 
    11/18/2014 08:07:43.072 w HpcMonitoringClient 1728 5400 Failed to sample counters, result: 258 
    11/18/2014 08:07:43.619 v HpcMonitoringClient 1728 5400 Sampling counters 
    11/18/2014 08:07:44.025 w HpcMonitoringClient 1728 7116 Failed to sample counters, result: 258 
    11/18/2014 08:07:44.572 v HpcMonitoringClient 1728 7116 Sampling counters 
    11/18/2014 08:07:44.978 w HpcMonitoringClient 1728 1052 Failed to sample counters, result: 258 
    11/18/2014 08:07:45.525 v HpcMonitoringClient 1728 1052 Sampling counters 
    11/18/2014 08:07:45.932 w HpcMonitoringClient 1728 7520 Failed to sample counters, result: 258 
    11/18/2014 08:07:46.479 v HpcMonitoringClient 1728 7520 Sampling counters 
    11/18/2014 08:07:46.885 w HpcMonitoringClient 1728 6468 Failed to sample counters, result: 258 
    11/18/2014 08:07:47.432 v HpcMonitoringClient 1728 6468 Sampling counters 
    11/18/2014 08:07:47.839 w HpcMonitoringClient 1728 2584 Failed to sample counters, result: 258 
    11/18/2014 08:07:48.386 v HpcMonitoringClient 1728 2584 Sampling counters 
    11/18/2014 08:07:48.792 w HpcMonitoringClient 1728 4544 Failed to sample counters, result: 258 
    11/18/2014 08:07:49.339 v HpcMonitoringClient 1728 4544 Sampling counters 
    11/18/2014 08:07:49.745 w HpcMonitoringClient 1728 6744 Failed to sample counters, result: 258 
    11/18/2014 08:07:50.292 v HpcMonitoringClient 1728 6744 Sampling counters 
    11/18/2014 08:07:50.698 w HpcMonitoringClient 1728 6252 Failed to sample counters, result: 258 
    11/18/2014 08:07:51.245 v HpcMonitoringClient 1728 6252 Sampling counters 
    11/18/2014 08:07:51.651 w HpcMonitoringClient 1728 5924 Failed to sample counters, result: 258 
    11/18/2014 08:07:52.198 v HpcMonitoringClient 1728 5924 Sampling counters 
    11/18/2014 08:07:53.151 v HpcMonitoringClient 1728 8236 Sampling counters 
    11/18/2014 08:07:54.058 w HpcMonitoringClient 1728 7960 Failed to sample counters, result: 258 
    11/18/2014 08:07:54.105 v HpcMonitoringClient 1728 7960 Sampling counters 
    11/18/2014 08:07:54.793 e HpcMonitoringClient 1728 3148 Failed to initialize collector. Retrying in 60 seconds. System.ArgumentException: An item with the same key has already been added...   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)..   at Microsoft.Hpc.Monitoring.MetricCollector.AddCounter(String path, Int32 metricId, Int32 instanceId)..   at Microsoft.Hpc.Monitoring.MetricCollector.Initialize() 
    11/18/2014 08:07:55.011 w HpcMonitoringClient 1728 5232 Failed to sample counters, result: 258 
    11/18/2014 08:07:55.058 v HpcMonitoringClient 1728 8680 Sampling counters 
    11/18/2014 08:07:55.965 w HpcMonitoringClient 1728 6176 Failed to sample counters, result: 258 
    11/18/2014 08:07:56.011 v HpcMonitoringClient 1728 6176 Sampling counters 
    11/18/2014 08:07:56.918 w HpcMonitoringClient 1728 1168 Failed to sample counters, result: 258 
    11/18/2014 08:07:56.965 v HpcMonitoringClient 1728 5232 Sampling counters 
    11/18/2014 08:07:57.871 w HpcMonitoringClient 1728 4344 Failed to sample counters, result: 258 
    11/18/2014 08:07:57.918 v HpcMonitoringClient 1728 4344 Sampling counters 
    11/18/2014 08:07:58.824 w HpcMonitoringClient 1728 9068 Failed to sample counters, result: 258 
    11/18/2014 08:07:58.871 v HpcMonitoringClient 1728 1168 Sampling counters 
    11/18/2014 08:07:59.778 w HpcMonitoringClient 1728 8492 Failed to sample counters, result: 258 
    11/18/2014 08:07:59.824 v HpcMonitoringClient 1728 8492 Sampling counters 
    11/18/2014 08:08:00.731 w HpcMonitoringClient 1728 6956 Failed to sample counters, result: 258 
    11/18/2014 08:08:00.778 v HpcMonitoringClient 1728 9068 Sampling counters 
    11/18/2014 08:08:01.684 w HpcMonitoringClient 1728 6628 Failed to sample counters, result: 258 
    11/18/2014 08:08:01.731 v HpcMonitoringClient 1728 6628 Sampling counters 
    11/18/2014 08:08:02.638 w HpcMonitoringClient 1728 6520 Failed to sample counters, result: 258 
    11/18/2014 08:08:02.684 v HpcMonitoringClient 1728 6520 Sampling counters 
    11/18/2014 08:08:03.591 w HpcMonitoringClient 1728 8908 Failed to sample counters, result: 258 
    11/18/2014 08:08:03.638 v HpcMonitoringClient 1728 6956 Sampling counters 
    11/18/2014 08:08:04.544 w HpcMonitoringClient 1728 5952 Failed to sample counters, result: 258 
    11/18/2014 08:08:04.591 v HpcMonitoringClient 1728 5952 Sampling counters 
    11/18/2014 08:08:05.486 w HpcMonitoringClient 1728 4836 Failed to sample counters, result: 258 
    11/18/2014 08:08:05.548 v HpcMonitoringClient 1728 4836 Sampling counters 
    11/18/2014 08:08:06.439 w HpcMonitoringClient 1728 4224 Failed to sample counters, result: 258 
    11/18/2014 08:08:06.502 v HpcMonitoringClient 1728 8908 Sampling counters 
    11/18/2014 08:08:07.392 w HpcMonitoringClient 1728 2880 Failed to sample counters, result: 258 
    11/18/2014 08:08:07.455 v HpcMonitoringClient 1728 2880 Sampling counters 
    11/18/2014 08:08:08.345 w HpcMonitoringClient 1728 6700 Failed to sample counters, result: 258 
    11/18/2014 08:08:08.408 v HpcMonitoringClient 1728 6700 Sampling counters 
    11/18/2014 08:08:09.299 w HpcMonitoringClient 1728 5712 Failed to sample counters, result: 258 
    11/18/2014 08:08:09.361 v HpcMonitoringClient 1728 4224 Sampling counters 
    11/18/2014 08:08:10.252 w HpcMonitoringClient 1728 92 Failed to sample counters, result: 258 
    11/18/2014 08:08:10.314 v HpcMonitoringClient 1728 92 Sampling counters 
    11/18/2014 08:08:11.205 w HpcMonitoringClient 1728 8500 Failed to sample counters, result: 258 
    11/18/2014 08:08:11.268 v HpcMonitoringClient 1728 8500 Sampling counters 
    11/18/2014 08:08:12.174 w HpcMonitoringClient 1728 8120 Failed to sample counters, result: 258 
    11/18/2014 08:08:12.221 v HpcMonitoringClient 1728 5712 Sampling counters 
    11/18/2014 08:08:13.127 w HpcMonitoringClient 1728 4092 Failed to sample counters, result: 258 
    11/18/2014 08:08:13.174 v HpcMonitoringClient 1728 4092 Sampling counters 
    11/18/2014 08:08:14.080 w HpcMonitoringClient 1728 6036 Failed to sample counters, result: 258 
    11/18/2014 08:08:14.127 v HpcMonitoringClient 1728 8120 Sampling counters 
    11/18/2014 08:08:15.034 w HpcMonitoringClient 1728 1444 Failed to sample counters, result: 258 
    11/18/2014 08:08:15.081 v HpcMonitoringClient 1728 1444 Sampling counters 
    11/18/2014 08:08:15.987 w HpcMonitoringClient 1728 4568 Failed to sample counters, result: 258 
    11/18/2014 08:08:16.034 v HpcMonitoringClient 1728 6036 Sampling counters 
    11/18/2014 08:08:16.940 w HpcMonitoringClient 1728 4896 Failed to sample counters, result: 258 
    11/18/2014 08:08:16.987 v HpcMonitoringClient 1728 4896 Sampling counters 
    11/18/2014 08:08:17.893 w HpcMonitoringClient 1728 8376 Failed to sample counters, result: 258 
    11/18/2014 08:08:17.940 v HpcMonitoringClient 1728 4568 Sampling counters 
    11/18/2014 08:08:18.362 i HpcTrace 1728 1740 Current Application Domain ProcessExit event invoked 
    11/18/2014 08:08:18.362 i HpcTrace 1728 1740 Cosmos Logger is being closed 
    11/18/2014 08:37:52.576 i HpcMonitoringClient 4796 7244 Adding counter: 2|0 : \Memory\Pages/sec 
    11/18/2014 08:37:52.576 i HpcMonitoringClient 4796 7244 Adding counter: 3|0 : \Memory\Available MBytes 
    11/18/2014 08:37:52.576 i HpcMonitoringClient 4796 7244 Adding counter: 4|0 : \System\Context switches/sec 
    11/18/2014 08:37:52.591 i HpcMonitoringClient 4796 7244 Adding counter: 5|0 : \System\System Calls/sec 
    11/18/2014 08:37:52.591 i HpcMonitoringClient 4796 7244 Adding counter: 6|1 : \PhysicalDisk(_Total)\Disk Bytes/sec 
    11/18/2014 08:54:45.300 i HpcMonitoringClient 4796 7244 Adding counter: 7|1 : \LogicalDisk(_Total)\Avg. Disk Queue Length 


    /Regards Andreas

    Tuesday, November 18, 2014 11:59 AM
  • Hi again,

    For information i noticed a command called "clusrun net stop HpcMonitoringClient" and "clusrun net start HpcMonitoringClient", but when i run this on one of the nodes i get the error message as below.


    /Regards Andreas

    Tuesday, November 18, 2014 6:19 PM
  • where are you run clusrun, on headnode?

    and you can open HpcClusterManager, go to "Job Management", and select "Admin Jobs", you may find more detail info about the failed task.

    Wednesday, November 19, 2014 2:45 AM
  • Hi,

    thanks for followup.

    The commands have I run on the nodes, guess thats correct.

    Another thing that I have noticed, is that when there is a job running, im not able to RDP to the nodes, it only times out, like the nodes are to busy, not sure if thats a correct way to work... but could have impact on the remote command here that it does not get any answer before it times out ?

    Here are screenshot of the faild job, is this a job that runs every day to get information about the nodes ?


    /Regards Andreas


    • Edited by Andreas2012 Wednesday, November 19, 2014 4:00 PM pictures.
    Wednesday, November 19, 2014 3:53 PM
  • for RDP, if you don't run clusrun job, can you RDP to the compute node,

    BTW, what is your compute node type, they are on-premise machine or Azure nodes?

    And for the failed task, can you check whether PipeProxy.exe is existed under %CCP_hpme\bin, we cannot repro this issue in our environment

    Thursday, November 20, 2014 1:51 AM
  • Hi,

    Have not had the time to follow up on your answer, sorry.

    I checked the system today, and i cant see any jobs under Admin, they are gone, is that correct?

    These are on-premise machines.

    The customer had run windows update on node 5, and did a reboot, and now this node shows in the "cpu usage", should we run windows update on every one, then..hehe.. may be there are some patches that is causing the problem... ?


    /Regards Andreas

    Monday, December 1, 2014 7:32 PM
  • seems rebooting the compute node can solve this issue, can you try on some compute node first,
    • Marked as answer by Andreas2012 Tuesday, December 2, 2014 9:26 PM
    Tuesday, December 2, 2014 1:41 AM
  • Reboot has solved the issue, just worndering if it will fail again. Closing this thread now...

    Thanks for support :)


    /Regards Andreas

    Tuesday, December 2, 2014 9:26 PM