locked
Multiple OMP jobs all on same node RRS feed

  • Question

  • We have a mixture of 4, 8 and 24-core nodes on our cluster. Users generally submit OMP jobs with something like

            job submit /scheduler:headnode /numcores:4 ...

    This has the problem of sometimes allocating cores across different nodes - a problem solved in another question in this forum as follows:-

            job submit /scheduler:headnode /numnodes:1 /corespernode:4 ...

    While this does make sure all the cores allocated are on the same node, if this job is scheduled onto one of the 24-core nodes, then the entire node is allocated to the job, and all other 4-core jobs will have to queue for it, which is a big waste of resources.

    So how do you ensure that you have all the cores for an OMP job allocated on the same node - but still allow for example 6x 4-core jobs to run simultaneously on a 24-core node?

    (Running MS HPC 2008 SP2, Version 2.2.1841.0)

    Thanks,

    Wes

     

     

    Monday, March 7, 2011 1:33 PM

Answers

  • Right, I think this is how you do it.

    [little rant here] A solution this complicated shouldn't be necessary to what is really a fundamental behaviour of all clusters. They shouldn't just split up an OpenMP program and run it with 'x' cores here, and 'y' cores there on different computers, and expect it to somehow work, as if the user has pre-emptively programmed MPI routines into the code in case it happens. You should just be able to specify on job submission, "I want 4 cores, all on the same node please", without preventing anyone else from using that node if it's got another 20 idle cores on it. (A scenario we almost continually have) [end of rant]

    As it is, if you submit a job that requests 'n' cores, and you want to ensure HPC will never split those 'n' across different nodes, then the following is C# code for an ActivationFilter.

    I'd appreciate it if anyone has time to cast an eye over the code and can give any comments/problems that you can spot.

    Cheers,

    Wes

    using Microsoft.Hpc.Scheduler;
    using Microsoft.Hpc.Scheduler.Properties;
    using System;
    using System.Xml;
    using System.Diagnostics;

    namespace HPCActFilter {

      class Program {
        private static string xmlNameSpace = @"http://schemas.microsoft.com/HPCS2008R2/scheduler/";

        private enum ReturnValue {
          StartJob = 0,
          BlockQueue = 1,
          DoNotStartJob = 2,
          HoldJobUntil = 3,
          FailJob = 4,
          FilterFailure = -1
        }

        static int Main(string[] args) {

          // For what it's worth, the example code in the SDK fails if the number of arguments is !=1. It seems
          // the activation filter is called with 5 arguments these days, so the example code needs updating, otherwise it always fails.

          // For that reason, (and others) I'm going to assume the system always calls this with a valid job argument first,
          // and not do any checking on args.length etc.

          String fileName = args[0];
          XmlDocument doc = new XmlDocument();
          doc.Load(fileName);
          XmlNamespaceManager nsMgr = new XmlNamespaceManager(doc.NameTable);
          nsMgr.AddNamespace("hpc", xmlNameSpace);
          XmlNode jobXML = doc.SelectSingleNode("/hpc:Job", nsMgr);
          XmlAttributeCollection attrCol = jobXML.Attributes;
          int id = Convert.ToInt32(attrCol["Id"].Value);
          IScheduler scheduler = new Scheduler();
          scheduler.Connect("localhost");
          int result = (int) ReturnValue.StartJob;
          result = chooseAppropriateNode(scheduler, id, attrCol);
          return result;
        }

        private static Boolean acceptableGroup(IStringCollection jobGroups, IStringCollection nodeGroups) {
          // Node that jobGroups, according to HPC conventions is "inclusive" - node must be in ALL groups that the job [template] specifies
          // to consider running the job on that node.
          Boolean nodeOK=true; // Assume OK until proven otherwise.
          int i=0;
          while (i<jobGroups.Count) {
            Boolean foundMatch=false;
            int j=0;
            while (j<nodeGroups.Count) {
              if (nodeGroups[j].Equals(jobGroups[i])) {   // If node.group[j] has job.group[i] we asked for, then...
                foundMatch=true;                          // Found the match
                j=nodeGroups.Count;                       // Stop looking through  - put j out of bounds.
              }
              j++;
            }
            if (!foundMatch) {                            // foundMatch wasn't set, so fail, and don't try any more
              nodeOK=false;                               // nodes in jobGroups, because we need ALL of them to succeed.
              i=jobGroups.Count;
            }
            i++;
          }
          return nodeOK;
        }

        private static int chooseAppropriateNode(IScheduler scheduler, int job_id, XmlAttributeCollection attrs) {
          int result=(int) ReturnValue.StartJob;
          ISchedulerJob ij = scheduler.OpenJob(job_id);
          if  ((ij.MinimumNumberOfCores > 1) &&
              (ij.State==JobState.Queued) &&
              (ij.RequestedNodes.Count==0)) {

            // Try and find an appropriate node for running the job.
            ISchedulerNode chosen_node = null;
            IFilterCollection onlineNodeFilter = scheduler.CreateFilterCollection();
            onlineNodeFilter.Add(FilterOperator.Equal, PropId.Node_State,NodeState.Online);                           // I only want online nodes
            onlineNodeFilter.Add(FilterOperator.Equal, PropId.Node_Reachable,true);
            onlineNodeFilter.Add(FilterOperator.GreaterThanOrEqual, PropId.Node_NumCores,ij.MinimumNumberOfCores);    // where cores > Minimum

            ISchedulerCollection nodes = scheduler.GetNodeList(onlineNodeFilter, null);     // Get all the online, reachable nodes, with enough cores.
            int noNodes = nodes.Count;                                          // how many...
            int node_no = 0;
            int smallestSoFar = -1;
            while (node_no < noNodes) {                                         // For each node
              ISchedulerNode isn = (ISchedulerNode)nodes[node_no];              // Get the node.
              if (acceptableGroup(ij.NodeGroups,isn.NodeGroups)) {              // Node must be in suitable group for job
                ISchedulerCollection isnCores = isn.GetCores();                 //   How many cores total?
                int freeCores = 0;                                              //   Finding number of free cores needs more detail.
                for (int core_no=0; core_no<isnCores.Count; core_no++)  {
                  if (((ISchedulerCore)isnCores[core_no]).JobId == 0) freeCores++;  // core.JobId==0 means nothing running on that core.
                }
                if (freeCores>=ij.MinimumNumberOfCores) {                     // Now check we haven't already requested it in a previous activation
                  IFilterCollection queuedJobFilter = scheduler.CreateFilterCollection();
                  queuedJobFilter.Add(FilterOperator.Equal,PropId.Job_State,JobState.Queued);    // Look up all the queued jobs
                  ISchedulerCollection queuedJobs = scheduler.GetJobList(queuedJobFilter, null);
                  for (int i=0; i<queuedJobs.Count; i++) {                               // Examine them one by one...
                    ISchedulerJob competingQueuedJob = (ISchedulerJob) queuedJobs[i];
                    if (competingQueuedJob.RequestedNodes.Count>0) {                     // If any of them have requested nodes set
                      if (competingQueuedJob.RequestedNodes.Contains(isn.Name)) {        // And the list of nodes contains the one we're considering
                        freeCores-=competingQueuedJob.MinimumNumberOfCores;              // then subtract the cores that job is requesting from our free number.
                      }
                    }
                  }

                  if (freeCores>=ij.MinimumNumberOfCores) {    // After all that, it still has enough free cores to consider.
                    if ((smallestSoFar==-1) || (freeCores<smallestSoFar)) {   // Then I want to choose the "tightest fit".
                      smallestSoFar=freeCores;                                // (ie, I am forcing a -Order actually).
                      chosen_node=isn;                                        // Aim for best resource usage.
                    }
                  }
                }
              }
              node_no++;
            }
            if (chosen_node!=null) {   // If we actually chose a suitable node to run on

              // I want to set "requestedNodes" for the job. I'd prefer to do it with Microsoft SDK, but I can't make it work.
              // I think you're meant to call configureJob, set the property, then commit. Doing this makes the property change
              // that you can see in Cluster Manager, but the job then runs on the node(s) it first thought of anyway, and
              // fails with a "the node became unreachable" error.
              //
              // HOWEVER, job modify, from the commandline, seems to do the trick. Not such a pretty solution, but anyway.

              Process p = new Process();
              p.StartInfo.UseShellExecute = false;
              p.StartInfo.RedirectStandardOutput = true;
              p.StartInfo.FileName = "job.exe";
              p.StartInfo.Arguments = "modify " + job_id + " /scheduler:localhost /requestedNodes:"+chosen_node.Name;
              p.Start();
              p.WaitForExit();

              // If you allow the job to start, it will happen with the current proposed resources - then get cancelled, then
              // restart with the newly modified resources. Better to hold it here - activation filter will be called again the next time
              // the queue is evaluated. HOWEVER - later jobs in the queue might want the resources we've just allocated
              // by the job modify - which is why we need all the code above to work out what resources are available,
              // subtracting "requestednode" information for any other jobs that we've manually requested nodes with.
            }
            result = (int)ReturnValue.DoNotStartJob;  // If chosen_node was null, then still don't start the job. Just keep trying.
                                                      // Eventually, requestedNodes will be >0, and none of this section will get called.
          }
          return result;
        }
      }
    }

                                                          
    • Marked as answer by WesHinsley Thursday, July 12, 2012 1:34 PM
    Thursday, July 12, 2012 1:26 PM

All replies

  • Hi,

    For various reasons, we have upgraded to MS HPC 2008 R2, but the functionality seems the same as above. Is it really the case that no-one has ever wanted to run more than a single OpenMP job on one node? It seems quite a basic requirement to us, if you have public nodes with plenty of cores, and plenty of users with different job types.

    Many thanks in advance if anyone can help us.

    Wes

    Tuesday, August 2, 2011 9:58 AM
  • Hi,

    I'm trying to use an ActivationFilter to solve this. The basic algorithm is as follows:-

    If the job has "RequestedNodes" set to something, then let it go ahead, return StartJob.
    Else, look up min number of cores requested for job
    If it's set to >1, then continue the algorithm, otherwise just return StartJob
    Query head node to find any node with enough idle cores. (Query core-by-core - works fine).
    If there is no candidate node, then don't do anything else, just return DontStartJob
    If there is a candidate node, then...
    Put the job in Configuring state
    Add candidate node to requestednodes (which previously was empty)
    Commit the change to the job, and return DontStartJob.
    (Next time the activation filter is called, RequestedNodes will be set, and the first line of the algorithm will let it run)

    My progress: all the bits of algorithm seem to work ok, but having run the filter, the job fails with "The job has encountered an error. "Job failed to start on some nodes or some nodes became unreachable". (And there's nothing wrong with the node - if I just job submit with that requestednode specified directly, all works fine). In the resource selection page in the Job Manager GUI, the requested node I set in the algorithm is ticked, so I can see the change is being applied.

    Is there anything I'm missing? The C# code I'm doing is this - having correctly got the jobId (int), and my IScheduler object is set up fine.

    scheduler.ConfigureJob(jobId);
    ij.RequestedNodes.Add("fi--didemrc15");   // Or any node I know is free at a given time. Can see in Job GUI this setting is applied.
    ij.Commit();
    return (int) ReturnValue.DoNotStartJob;  // As in the sample code for an activation filter. Value 2.

    Any ideas greatly appreciated. All clients, nodes and headnode are currently on on 2008 R2 3.2.3716.0

    Wes

    Friday, July 6, 2012 2:19 PM
  • Just to clarify some more, when I look at the properties of the failed task in the Job GUI, although RequestedNodes is set to fi--didemrc15, in the activity log, the job is actually started (and immediately failed) on fi--didemrc13 - the first available node on the list at the moment, so I guess the original node that the scheduler would have chosen.
    Friday, July 6, 2012 2:21 PM
  • Right, I think this is how you do it.

    [little rant here] A solution this complicated shouldn't be necessary to what is really a fundamental behaviour of all clusters. They shouldn't just split up an OpenMP program and run it with 'x' cores here, and 'y' cores there on different computers, and expect it to somehow work, as if the user has pre-emptively programmed MPI routines into the code in case it happens. You should just be able to specify on job submission, "I want 4 cores, all on the same node please", without preventing anyone else from using that node if it's got another 20 idle cores on it. (A scenario we almost continually have) [end of rant]

    As it is, if you submit a job that requests 'n' cores, and you want to ensure HPC will never split those 'n' across different nodes, then the following is C# code for an ActivationFilter.

    I'd appreciate it if anyone has time to cast an eye over the code and can give any comments/problems that you can spot.

    Cheers,

    Wes

    using Microsoft.Hpc.Scheduler;
    using Microsoft.Hpc.Scheduler.Properties;
    using System;
    using System.Xml;
    using System.Diagnostics;

    namespace HPCActFilter {

      class Program {
        private static string xmlNameSpace = @"http://schemas.microsoft.com/HPCS2008R2/scheduler/";

        private enum ReturnValue {
          StartJob = 0,
          BlockQueue = 1,
          DoNotStartJob = 2,
          HoldJobUntil = 3,
          FailJob = 4,
          FilterFailure = -1
        }

        static int Main(string[] args) {

          // For what it's worth, the example code in the SDK fails if the number of arguments is !=1. It seems
          // the activation filter is called with 5 arguments these days, so the example code needs updating, otherwise it always fails.

          // For that reason, (and others) I'm going to assume the system always calls this with a valid job argument first,
          // and not do any checking on args.length etc.

          String fileName = args[0];
          XmlDocument doc = new XmlDocument();
          doc.Load(fileName);
          XmlNamespaceManager nsMgr = new XmlNamespaceManager(doc.NameTable);
          nsMgr.AddNamespace("hpc", xmlNameSpace);
          XmlNode jobXML = doc.SelectSingleNode("/hpc:Job", nsMgr);
          XmlAttributeCollection attrCol = jobXML.Attributes;
          int id = Convert.ToInt32(attrCol["Id"].Value);
          IScheduler scheduler = new Scheduler();
          scheduler.Connect("localhost");
          int result = (int) ReturnValue.StartJob;
          result = chooseAppropriateNode(scheduler, id, attrCol);
          return result;
        }

        private static Boolean acceptableGroup(IStringCollection jobGroups, IStringCollection nodeGroups) {
          // Node that jobGroups, according to HPC conventions is "inclusive" - node must be in ALL groups that the job [template] specifies
          // to consider running the job on that node.
          Boolean nodeOK=true; // Assume OK until proven otherwise.
          int i=0;
          while (i<jobGroups.Count) {
            Boolean foundMatch=false;
            int j=0;
            while (j<nodeGroups.Count) {
              if (nodeGroups[j].Equals(jobGroups[i])) {   // If node.group[j] has job.group[i] we asked for, then...
                foundMatch=true;                          // Found the match
                j=nodeGroups.Count;                       // Stop looking through  - put j out of bounds.
              }
              j++;
            }
            if (!foundMatch) {                            // foundMatch wasn't set, so fail, and don't try any more
              nodeOK=false;                               // nodes in jobGroups, because we need ALL of them to succeed.
              i=jobGroups.Count;
            }
            i++;
          }
          return nodeOK;
        }

        private static int chooseAppropriateNode(IScheduler scheduler, int job_id, XmlAttributeCollection attrs) {
          int result=(int) ReturnValue.StartJob;
          ISchedulerJob ij = scheduler.OpenJob(job_id);
          if  ((ij.MinimumNumberOfCores > 1) &&
              (ij.State==JobState.Queued) &&
              (ij.RequestedNodes.Count==0)) {

            // Try and find an appropriate node for running the job.
            ISchedulerNode chosen_node = null;
            IFilterCollection onlineNodeFilter = scheduler.CreateFilterCollection();
            onlineNodeFilter.Add(FilterOperator.Equal, PropId.Node_State,NodeState.Online);                           // I only want online nodes
            onlineNodeFilter.Add(FilterOperator.Equal, PropId.Node_Reachable,true);
            onlineNodeFilter.Add(FilterOperator.GreaterThanOrEqual, PropId.Node_NumCores,ij.MinimumNumberOfCores);    // where cores > Minimum

            ISchedulerCollection nodes = scheduler.GetNodeList(onlineNodeFilter, null);     // Get all the online, reachable nodes, with enough cores.
            int noNodes = nodes.Count;                                          // how many...
            int node_no = 0;
            int smallestSoFar = -1;
            while (node_no < noNodes) {                                         // For each node
              ISchedulerNode isn = (ISchedulerNode)nodes[node_no];              // Get the node.
              if (acceptableGroup(ij.NodeGroups,isn.NodeGroups)) {              // Node must be in suitable group for job
                ISchedulerCollection isnCores = isn.GetCores();                 //   How many cores total?
                int freeCores = 0;                                              //   Finding number of free cores needs more detail.
                for (int core_no=0; core_no<isnCores.Count; core_no++)  {
                  if (((ISchedulerCore)isnCores[core_no]).JobId == 0) freeCores++;  // core.JobId==0 means nothing running on that core.
                }
                if (freeCores>=ij.MinimumNumberOfCores) {                     // Now check we haven't already requested it in a previous activation
                  IFilterCollection queuedJobFilter = scheduler.CreateFilterCollection();
                  queuedJobFilter.Add(FilterOperator.Equal,PropId.Job_State,JobState.Queued);    // Look up all the queued jobs
                  ISchedulerCollection queuedJobs = scheduler.GetJobList(queuedJobFilter, null);
                  for (int i=0; i<queuedJobs.Count; i++) {                               // Examine them one by one...
                    ISchedulerJob competingQueuedJob = (ISchedulerJob) queuedJobs[i];
                    if (competingQueuedJob.RequestedNodes.Count>0) {                     // If any of them have requested nodes set
                      if (competingQueuedJob.RequestedNodes.Contains(isn.Name)) {        // And the list of nodes contains the one we're considering
                        freeCores-=competingQueuedJob.MinimumNumberOfCores;              // then subtract the cores that job is requesting from our free number.
                      }
                    }
                  }

                  if (freeCores>=ij.MinimumNumberOfCores) {    // After all that, it still has enough free cores to consider.
                    if ((smallestSoFar==-1) || (freeCores<smallestSoFar)) {   // Then I want to choose the "tightest fit".
                      smallestSoFar=freeCores;                                // (ie, I am forcing a -Order actually).
                      chosen_node=isn;                                        // Aim for best resource usage.
                    }
                  }
                }
              }
              node_no++;
            }
            if (chosen_node!=null) {   // If we actually chose a suitable node to run on

              // I want to set "requestedNodes" for the job. I'd prefer to do it with Microsoft SDK, but I can't make it work.
              // I think you're meant to call configureJob, set the property, then commit. Doing this makes the property change
              // that you can see in Cluster Manager, but the job then runs on the node(s) it first thought of anyway, and
              // fails with a "the node became unreachable" error.
              //
              // HOWEVER, job modify, from the commandline, seems to do the trick. Not such a pretty solution, but anyway.

              Process p = new Process();
              p.StartInfo.UseShellExecute = false;
              p.StartInfo.RedirectStandardOutput = true;
              p.StartInfo.FileName = "job.exe";
              p.StartInfo.Arguments = "modify " + job_id + " /scheduler:localhost /requestedNodes:"+chosen_node.Name;
              p.Start();
              p.WaitForExit();

              // If you allow the job to start, it will happen with the current proposed resources - then get cancelled, then
              // restart with the newly modified resources. Better to hold it here - activation filter will be called again the next time
              // the queue is evaluated. HOWEVER - later jobs in the queue might want the resources we've just allocated
              // by the job modify - which is why we need all the code above to work out what resources are available,
              // subtracting "requestednode" information for any other jobs that we've manually requested nodes with.
            }
            result = (int)ReturnValue.DoNotStartJob;  // If chosen_node was null, then still don't start the job. Just keep trying.
                                                      // Eventually, requestedNodes will be >0, and none of this section will get called.
          }
          return result;
        }
      }
    }

                                                          
    • Marked as answer by WesHinsley Thursday, July 12, 2012 1:34 PM
    Thursday, July 12, 2012 1:26 PM
  • Attempting to change a Job via API calls while it is in Activation (or Submission) is unsupported and the scheduling behavior is undefined.

    There are a few ways to work around this but in all cases Activation (or Submission) must be allowed to complete before any API-based changes to the job are attempted.

    One approach using in-proc Activation filter:

    1: If the job is not yet running the return code can instruct the scheduler to block it.  Running jobs have more limited options.

    2: After the Activation is completed, have a thread wake up and make the desired API calls.

    This is not ideal for your use case but perhaps the above can give you ideas on better workarounds.

    d

    Monday, July 23, 2012 8:09 PM
  • Hi Daryl,

    Yes - I gathered anecdotally that API-based changes to some of the job properties during the activation filter do not behave as I expected. It would be helpful if the documentation and example code for activation filters could be updated to state this clearly - the examples go as far as creating the IScheduler object, looking up the ISchedulerJob, and modifying the "set hold until" property, so there is no obvious reason in the documentation why you therefore shouldn't also call other ISchedulerJob methods, or modify other members that are stated as ok for read/write in MSDN for the appropriate job state.

    Hence, the code above doesn't do any API calls from within C# to modify the job - it just calls "job modify" from the commandline, just as if the original user had done it themselves (which they're surely allowed to do while the job is queuing, since they don't know if the activation filter is occurring...), and then returns a "don't start" as the filter result. The modifications done by "job modify" seem to always be applied before the next time the activation filter is called on that job, but even if not, the script will just modify them again and hold the job. Once the "job modify" has been processed, the job passes through the filter without being modified, and returns the "start job" value.

    I suspect this is roughly similar to the solution you outlined above - just that I'm letting the scheduler respond to "job modify", rather than having a process woken up to do the modify via API calls at a "legal" moment.

    My workaround for adding "/numcores:min-max" support (since above code assumes min==max) is even more elaborate, but that's probably for another thread!

    Thanks,

    Wes

    Tuesday, July 24, 2012 3:01 PM
  • One other idea for anyone reading this! There are lots of examples out there that use omp_get_max_threads() to query how many threads to start, but that returns you the number of physical cores on a machine, not the number of cores HPC might have allocated you. (This comes up especially often with the code above). If you want to have exactly one OpenMP thread per core that you were allocated, then you want some code along these lines, to lookup the environment variable CCP_NUMCPUS:-

    int get_NCores() {                                 
      char * val;                                                                        
      char * endchar;
      val = getenv("CCP_NUMCPUS");
      int cores = omp_get_max_threads();
      if (val!=NULL) cores = strtol (val,&endchar,10);
      return cores;                                                                        
    }
    ...... somewhere near the top of your main code:-

    omp_set_num_threads(get_NCores());

    Tuesday, July 24, 2012 3:21 PM
  • The cli job.exe and consumers of IScheduler end up calling the same Scheduler API.  During a filteration event, shelling out to "job modify" and calling IScheduler methods do not differ in fundimentals.  Timings will differ, of course.

    Thus, results that are "undefined' for IScheduler calls are still "undefined" for calls made via "job.exe".  "Undefined" does not always mean "will fail" and I will keep your approach in mind for when other customers have similiar needs.

    The Samples do include examples of such reentrant calls... and your point is well taken.

    d

    Tuesday, July 24, 2012 7:12 PM
  • Thanks Daryl,

    I'll keep an eye on the behaviour of this workaround - I've had no failures with the "job modify" call yet, whereas I couldn't get a single success through calling methods in C# - certainly something different was happening internally - perhaps I was just using the API incorrectly, or perhaps the timings were different enough to ensure the effects of the job modify occur outside of the activation filter.

    So ignoring my script for a moment, if a user/client somewhere calls "job modify", on their queuing job, then is that behaviour still, technically, undefined? If so, then I believe what we're saying is: "undefined behaviour" = there is a chance the job might have just moved out of the queuing state, into the running (or other) state, so the modification might get ignored.

    If that's the limit of how "undefined" the behaviour is, then we can easily live with that, especially as the script keeps the job in the queuing state after modifying it. 

    W.

    Wednesday, July 25, 2012 10:51 AM
  • The filtration events (are supposed to be very short and) occur in a scheduling pass (which is very quick generally).

    Scheduling passes assume full authority for job state and there are protections around critical data.

    That having been said, if an API call (again, no real difference between job.exe and IScheduler since they both call the same API) attempts to modify data that is actually used for scheduling (ie: not progress text, etc) the current scheduling pass is likely to not see the change.   Subsequent passes would, in theory.

    d

    Wednesday, July 25, 2012 11:33 PM
  • Is there a recommended way for accomplishing this yet? Or does the 2012 HPC version add support?  If not, does your filter continue to work Wes?

    Thanks

    Monday, December 17, 2012 6:43 PM
  • It appears in the docs you can do job submit /singlenode:true in HPC 2012 - which (finally) seems to address this problem. I haven't tried it myself, but that sounds like a much better solution than my filter - although the filter is still running nicely on our two clusters.

    Cheers,

    W.



    • Edited by WesHinsley Sunday, December 23, 2012 8:22 PM
    Sunday, December 23, 2012 7:59 PM