none
Update 3 IdleDetector.notidle? RRS feed

  • Question

  • The release notes for Update 3 at https://technet.microsoft.com/en-us/library/mt595796.aspx reference improved idle logic using a file called IdleDetector.notidle.  I may be missing it, but is the usage of this file documented anywhere?  We have had lots of issues with idle detection, and I was hoping to use this update to improve it.

    Otherwise, I am very open to suggestions.  We have several workstation nodes that are available all the time, but can be in use locally in addition to via HPC.  Even with CPU usage set at 10% or below, if a user is running something locally, HPC will send jobs to it, causing lots of resource contention. 

    Monday, March 7, 2016 9:35 PM

All replies

  • This change for idle detection logic won't affect the users of previous version if you do not need it.

    A file named IdleDetector.notidle could be placed in %CCP_HOME%Bin folder if you want this node keep in not idle state(occupied).

    You could set the "Period of time" option in the workstation node template to a bigger number, like 3600s, to avoid the node turning into idle state frequently.

    You could also monitor the idle state and CPU usage rate in label "Idle" and "CPU Usage(%)" of the "Resource Management" tab in Cluster Manager.

    Or type "node view <nodename> /detailed" in command line and check the "Availability" option.

    If the idle state is keep in "idle" or the availability is keep in "Available" while your workstation's CPU usage has beyond the threshold you set in workstation node template for a while(about 5s), or if the scheduler dispatches a job to the node in "not idle" state, you may encounter an issue or bug.

    PS. The change of workstation node template need time to come into effect.

    Tuesday, March 8, 2016 11:21 AM
  • Thanks for the reply.  All members of the cluster have been updated to Update 3.  The templates have not been changed in some time, so they should all be effective.

    I must have misunderstood.  I thought the IdleDetector.notidle file was meant to contain some sort of custom rules.  Is it supposed to just be a blank file, and that in turn tells the cluster that the system is occupied and thus not eligible for jobs?

    The machines in question are always available.  They are set to 10% CPU for 600 seconds.  I did not use keyboard/mouse detection as they are often given jobs (such as a MATLAB job) and left to run for several hours.  We are seeing similar results with user workstations, which come online to the cluster after office hours; if someone leaves their desktop running a job, HPC still tries to use it and completely fails.

    In the end, my goal is for the cluster to accurately detect something is running and exempt the machine, and for the cluster in general to more seamlessly deal with this.

    Thanks!

    Tuesday, March 8, 2016 2:48 PM
  • Thanks for your reply and more details.

    The IdleDetector.notidle file is indeed just a blank file, which tells the headnode this workstation is occupied and thus not eligible for jobs. (You could also use this method to help report idleness accurately.)

    When you run a job on the workstation node, you could monitor the CPU usage of it (eg. view it in Task Manager). Then check if it reports the right result to headnode. (Use the method I provided in last reply.)

    If you find the CPU usage in your workstation node is always beyond the threshold you set (10%) while the state of the workstation node displayed in headnode is "idle" or "available", pls report this to us. This could help to locate the issue.

    Wednesday, March 9, 2016 3:28 AM
  • Thanks for that.  I added the IdleDetector.notidle file and found that "node view <nodename> /detailed" does immediately show the node as unavailable, and it wouldn't accept a job.  I'll have to change permissions on that folder, but I believe our users can pretty easily add and remove that file when they're running something manually.

    I'll have to keep a closer eye on what's happening with the cluster seemingly ignoring usage.  The cluster manager appears to be correctly seeing the CPU usage, and did refuse to send a job to a node that was at 75% usage with non-HPC work.  Again though, the IdleDetector.notidle file may be the clean and simple way of handling it.

    Wednesday, March 9, 2016 4:22 PM
  • Happy to know your deeper understanding and investigation. You could just use IdleDetector.notidle file to mark the workstation's idleness state if the CPU usage is not stable enough.
    Thursday, March 10, 2016 8:13 AM