locked
ActivationFilter Question RRS feed

  • Question

  • Hi,

    I have a custom activation filter that is giving some unusual behavior, I want to verify my understanding of how it should work.

    I have a head node that acts as a compute node with 4 processors, no other nodes on this test machine.

    I submit a 1 cpu exclusive job that the activation filter does not permit, and in the job queue shows the job status as "Queued" and  Pending with the message "job xx activation is being held by the activation filter"
    which is correct.

    Now I submit another 1 cpu exclusive job that the activation filter will permit, and in the job queue the message says
       "not enough available processors". Note the first job is still queued, no jobs are running, and so this job
    should be passed to the activation filter for consideration.

    However, the CCS scheduler does not even query the activation filter for this second job.
    If I kill the first job that is in the "Queued" state because the activation filter will not let it start, then the second job is passed to activation filter, and then it starts.

    I assumed that when the first job was pending by the activation filter, it would stay in the queue, and the scheduler would go through the queue until it found a job that it could run, and run it if the activation filter permits it.

    What it appears that is happening is the scheduler is holding the nodes for the job that the activation filter would not start,
    and so the scheduler does not consider other jobs for the node, it just sits waiting and keeps trying to ping the activation filter for if that job can start.

    The behavior I am seeing doesn't seem right, as then one job could block a bunch of other jobs that could start. Is there a reason that the scheduler would be doing this?

    backfilllookahead is default -1.  The Compute Cluster Manager shows that the node is available and 4 cpus are ready, no other jobs are running during this testing. I bumped up the logging with cluscfg, nothing interesting showed up.


    Thanks for any insight,

    -Cham


    UPDATE: If I uncheck the "exclusive" checkbox for all the jobs in the Queued state, then seems to work right, , other jobs in the queue are then passed to the activation filter, and they run through as expected. This seems to validate that for some reason CCS scheduler is holding onto the compute resources for a queued exclusive job that is held up by the activation filter for some reason.

    I added some additional nodes to the cluster, same thing, if I have 2 nodes, and two jobs held by the activation filter, all other jobs will block and won't be given to the activation filter for consideration.

    I made a very simple filter to test this, if the job name is "hold" it holds the job, otherwise it runs.


    public class Simple
    {
    public static int Main(string[] args)
    {
    XmlDocument jobXml = new XmlDocument();
    jobXml.Load(args[0]);
    String jobName = ExtractJobName(jobXml);

    int ret = 0;

    if (jobName.Equals("hold")) {
    ret = 1;
    } else {
    ret = 0;
    }

    return ret;
    }

    public static string ExtractJobName(XmlDocument jobXml)
    {
    XmlNamespaceManager mgr = new XmlNamespaceManager(jobXml.NameTable);
    mgr.AddNamespace("ab", "http://www.microsoft.com/ComputeCluster/");

    XmlNodeList xmlNodeList = jobXml.SelectNodes("//Job/@Name", mgr);
    XmlAttribute nameAttr = (XmlAttribute)xmlNodeList.Item(0);

    return nameAttr.Value;
    }

    }
    Thursday, February 7, 2008 11:56 PM

Answers

  • "However, the CCS scheduler does not even query the activation filter for this second job.
    If I kill the first job that is in the "Queued" state because the activation filter will not let it start, then the second job is passed to activation filter, and then it starts."

     

    This is actuall expected behavior . . . the queue should be blocked by the job which has not activated.  Activation Filters were desgned for license scheduling types of behavior, and the assumption is that if the job is stopped at Activation it will start again soon . . . it is still at the head of the queue, and from the scheduler's point of view could be restarted at any time (as soon as the license becomes available to allow it to pass the activation filter).

     

     

     

    "If I uncheck the "exclusive" checkbox for all the jobs in the Queued state, then seems to work right, , other jobs in the queue are then passed to the activation filter, and they run through as expected."

     

    While you say this is "right", from our perspective this actually looks like a bug with the Backfill algorithm Smile

    Friday, February 15, 2008 11:44 PM
    Moderator

All replies

  • "However, the CCS scheduler does not even query the activation filter for this second job.
    If I kill the first job that is in the "Queued" state because the activation filter will not let it start, then the second job is passed to activation filter, and then it starts."

     

    This is actuall expected behavior . . . the queue should be blocked by the job which has not activated.  Activation Filters were desgned for license scheduling types of behavior, and the assumption is that if the job is stopped at Activation it will start again soon . . . it is still at the head of the queue, and from the scheduler's point of view could be restarted at any time (as soon as the license becomes available to allow it to pass the activation filter).

     

     

     

    "If I uncheck the "exclusive" checkbox for all the jobs in the Queued state, then seems to work right, , other jobs in the queue are then passed to the activation filter, and they run through as expected."

     

    While you say this is "right", from our perspective this actually looks like a bug with the Backfill algorithm Smile

    Friday, February 15, 2008 11:44 PM
    Moderator
  • What is your Activation Filter doing that you require other jobs to move past jobs blocked by activation?  I'd like to understand your goal! Smile

    Friday, February 15, 2008 11:45 PM
    Moderator
  • I think the major issue is that CCS decides to look at if a job can start after all resources are assigned. If the reason the job can't start is lengthy ( which in the CAE space it often is ) you end up with
    the situation where the nodes are idled for a long time, with all jobs in the Q state tieing them up because
    of the ActivationFilter. Other jobs in the queue that could run will never get a chance, so the cluster sits idle.

    It would seem better to do this check before the resources are actually assigned.

    It appears that when this feature was developed, it was assumed that lack of a  license would be a very short duration thing, and that only one application may be running on the cluster.

    An alternative solution is if the job sits in the Q state for X amount of scheduler iterations because of the ActivationFilter denying it, release the resources that CCS has assigned to the job and put it back on the queue to free up another job for consideration as to not tie things up indefinitely.

    Finally, if the status quo is kept, I would recommend having a different state other than Q when the activation filter is holding the job in Q and tieing up allocated resources ( like pending or something) to make it obvious to the user why all nodes may idle, all jobs in Q state, and no other jobs running.

    Here is the use case for many CAE deployments in case you are interested in what I am doing:

    I have four different codes, ABAQUS, RADIOSS, NASTRAN, LS-DYNA for example. Each application may run for 1-2 days, and may take 1 day to obtain a license (quite normal, the cost of the license means very limited number). If there are 4 jobs of each application in the queue with four nodes, and the first four are abaqus and CAN'T obtain a license, then in the current scenario the other 12 jobs of other applications in the queue that may be able to run may have to wait 1 or 2 days while the cluster sits idle waiting to run the first four jobs that don't have ABAQUS licenses.

    I would like the CCS scheduler should say "I can't run the ABAQUS jobs right now because of license problems, so let me put those back on the queue a and try to run some other jobs like RADIOSS that MAY (or may not) have licenses. This is the way PBS and Platform handle this scenario under FIFO btw.

    This is a common scenario in the CAE space, hopefully you can tweak CCS or HPC server 2008 to support this use case.

    Saturday, February 16, 2008 4:11 PM
  •  

    Cham,

    Thanks for the feedback; what you say makes a lot of sense.  I've filed this as a feature request for HPC 2008; we will take it into consideration for inclusion in a future release.

     

    Thanks!
    Josh

    Friday, March 28, 2008 11:11 PM
    Moderator
  • Cham,

    May be this a little old for you but if you still dealing with this...:

    Actually I had the same problem with an implementation of my activation filter (license usage), I realized that the filter will block the queue, even if other jobs can use it.

    So my "work around" for this was to instruct the activation filter not to cancel the job, instead of that, just move the job to "configure" state; in this way it will "unblock" the queue to the rest of the jobs; then I have other program working in background who is monitoring all the jobs in the "configure state" because licenses ,  and it will try to resubmit the job in a "period of time" you can set. Then if the activation filter detect not enough licenses it will put this job in "configure state" again , and the cycles repeats.

    But as other users says, you can have the risk this job never runs because the resources are taken by other jobs, but so far this implementation works for me...

    I can "emulate" the next version of this "behavior" deciding if pass the job to configure state for a period of time, cancel it or  just block the queue...

    The next version of HPC, will support fully this capability, but for now my activation filter is working just fine... I have other problem trying to resubmit it (you can see it as "Activation filter: how to get rid of Remember this password? (Y/N)" thread....

    Regards

     Larry

    Thursday, May 6, 2010 1:15 PM