none
Job vs Task resource requirement specification RRS feed

  • Question

  • I am having trouble understanding Job vs Task resource requirement specification (minimum number of cores, etc.)  A Job, in and of itself, seems to use no resources (except for mgmt overhead).  It is only its Tasks that use resources.  So I don't understand why one can specify the minimum number of cores for a Job, for example.  That seems meaningless, unless it is a default for the tasks within the Job (which seems like a poor design for an API).  Is resource allocation only done per Job, not per Task?
    Sunday, May 23, 2010 10:20 PM

Answers

  • Hi Ryan,

    Aside from the API docs, the help files that come with Windows HPC Server 2008 provide some explanation of job scheduling.  Beyond that, you are always welcome to ask questions on this forum.

    To answer your specific questions above:

    1) Yes, all jobs will be entirely serial.  No job will start until all its required resources are allocated to it - which means that the resources may not be used by any other currently running job.  If all jobs require the Mux node exclusively, the jobs will run serially on the cluster.

    1a) Not directly.  No two scheduler tasks may share the same core.  However, you can start a single task on all the four cores, and have that task start up as many child processes as you would like.  Each child process will run on one of the four cores allocated to your scheduler task.

    2) Again, not directly, especially if the Copy commands belong to two different jobs.   The Job Scheduler currently does not provide any mechanism for cross-coordinating between tasks belonging to different jobs.  However, if your head node contains, say, 8 cores, you may want to set the resource requirement for each Copy task to 3 cores.  This will ensure that the scheduler will never be able to run more than 2 copy tasks on the head node.  Keep in mind, however, that artificially raising the minimum resource requirement for a task will correspondingly raise the minimum resource requirement for a job and will reduce cluster utilization.

    3) Generally, no.  A node-based task (a task with the UnitType 'Node') must belong to a node-based job.  Likewise, an Exclusive task must belong to an Exclusive job (although tasks in an Exclusive job do not have to be exclusive with respect to one another). 

    The one exception to this rule are Node Preparation and Node Release tasks, introduced in HPC Server 2008 R2 Beta.  These tasks run exclusively on a node with respect to other tasks in the job, even if the job itself is not exclusive.  Multiple Node Preparation and Node Release tasks from different jobs can run on the same node concurrently.  A Node Preparation task will run immediately after the node is first allocated to the job, and can be used for such tasks as copying files onto the node.  Conversely, a Node Release task will run right before the job is ready to give up all resources on a node (typically, this happens when the job ends, but there are other conditions, such as preemption, that may trigger this).  If you are running R2 Beta 2, you may consider using Node Preparation and Node Release tasks to perform file copying.

    Best regards,
    Leonid.

    Wednesday, May 26, 2010 3:59 AM

All replies

  • Hi Ryan,

    In order to operate in an environment where multiple jobs are contending for the same set of resources, the Job Scheduler needs to know the total amount of resources that a job would require during its lifetime.  Resources are allocated to a job as a whole, and then are sub-divided between the individual tasks in the job.  Knowing each job's resource requirement allows the scheduler to start multiple jobs on the cluster at once, assuming that all their requirements can be satisfied.  Each job's resource requirement also determines how many of its tasks can be started in parallel while the job is running.

    By default, you do not need to specify the job's resource requirement, since it is automatically calculated from the resource requirements of individual tasks in the job.  For example, if your job has 3 basic tasks, each requiring a minimum of 2 cores and a maximum of 3, the scheduler will determine that the total number of cores required by your job is between 2 and 9.  This is because at least two cores are needed to run one of the job's tasks at its minimum requirement, but no more than 3x3=9 cores are needed to run all the job's tasks simultaneously at their maximum.  Before starting this job, the scheduler will check how many cores are currently idle on the cluster.  If only 2 cores are idle, the scheduler will alllocate them to the job and run the tasks one at a time, giving the 2 cores to each task in turn.  However, if 50 cores are idle, the scheduler will allocate only 9 of them to the job, and then start all tasks in parallel, giving each task its maximum of 3 cores.  The remaining 41 cores will remain idle and can be allocated to subsequent jobs.

    In most cases, this auto-calculation works well and you don't need to explicitly specify the job's resource requirements.  In certain situations, however, you may wish or be required to do so.  For example, when you mark a job as RunUntilCanceled, the scheduler cannot auto-calculate the total resource requirements for a job, and you need to specify them manually.  Alternatively, if your job has many tasks (or a large parametric task), you may wish to voluntarily limit the maximum amount of resources allocated to your job, in order to allow other jobs to run on the cluster.

    In summary, resources are allocated to a job as a whole, and then are divided between the job's tasks.  In most cases, you only need to specify each task's resource requirement, and the scheduler will use this information to auto-calculate the job's overal resource requirement.  However, if you would like to specify the job's minimum and maximum yourself, you have the capacity to do so.

    Best regards,
    Leonid.

     

    Monday, May 24, 2010 6:55 PM
  • Thanks for the help.  But I'm not sure I fully understand. 

    Say I have a job that has two tasks: Task1 uses 1 core and takes an hour, and Task2 requires Task1 be done first, and it uses 8 cores, and also takes an hour.  If I understand the automatic job-level resource calculation, it will compute 1 to 8 cores for the job.  If there are 8 cores available on the cluster when the job is scheduled, during the first hour when Task 1 is running, how many cores are allocated to the job?

    Tuesday, May 25, 2010 12:33 AM
  • This is not quite correct.  If Task1 requires 1 core (that is, it has both a minimum and a maximum requirement of 1 core), and Task2 similarly requires 8 cores, and Task2 is dependent on Task1, then the scheduler will compute both the minimum and the maximum for the job as 8 cores.  That is because the job needs 8 cores to ensure that Task2 can run.

    Until 8 cores are available on the cluster, the job will remain queued.  Once 8 cores become available, the scheduler will start Task1 one one core, keeping the other 7 idle.  Then, when Task2 runs, it will use all 8 cores.  While this leads to a sub-optimal utilization of resources, this ensures that a job can continue to run un-interrupted once it starts.

    Tuesday, May 25, 2010 2:44 AM
  • Thanks for the continued help. I'm still having trouble with scheduling stuff.  I'm creating and submitting jobs using the C# API, and those docs don't explain this stuff well enough for me.  Is there a better source of information for information about resources and scheduling?

    If not, here are some more questions.  I'm creating a video transcoding service.  Each job will transcode a video from one format to another.  In a simplified model, the job consists of CopyInSource (copies source file into local file system), EncodeAudio, EncodeVideo, Mux (multiplex), and CopyOutResult.  The Copy* tasks are just robocopy commands.  EncodeAudio takes maybe 20 minutes and only uses 1 core.  EncodeVideo takes 3 hours and can use all the cores on a node.  The Mux task must be run on one particular node (not the head node) due to software licensing.  I prefer that the Copy* tasks run on the head node because the head node has the cache directory physically attached and I assume the copies will be more efficient that way.

    With that setup, I have these questions:

    1) If all jobs require the Mux node, and the mux node can only run one mux at a time, and scheduling works like you said above, do all jobs become entirely serial?  Because by the philosophy you said above, the scheduler must reserve all resources for the Job. 

    1a) If the mux node can run more than one mux at a time, but only has 4 cores, can I tell HPC to allow more than 4 mux tasks to run there at once?

    2) I'd like to only have a couple Copy commands going at once on the head node (because I care more about the duration/latency of jobs than total throughput).  Is there a way to tell HPC to only run 2 of the Copy* commands on the head node at once?

    3) Is there a way to say that a Task (not Job) requires an entire node, regardless of number of sockets/cores?  I see that IsExclusive is for tasks in the same job, per the API docs.

    Thanks,

    Ryan

     

    Wednesday, May 26, 2010 12:16 AM
  • Hi Ryan,

    Aside from the API docs, the help files that come with Windows HPC Server 2008 provide some explanation of job scheduling.  Beyond that, you are always welcome to ask questions on this forum.

    To answer your specific questions above:

    1) Yes, all jobs will be entirely serial.  No job will start until all its required resources are allocated to it - which means that the resources may not be used by any other currently running job.  If all jobs require the Mux node exclusively, the jobs will run serially on the cluster.

    1a) Not directly.  No two scheduler tasks may share the same core.  However, you can start a single task on all the four cores, and have that task start up as many child processes as you would like.  Each child process will run on one of the four cores allocated to your scheduler task.

    2) Again, not directly, especially if the Copy commands belong to two different jobs.   The Job Scheduler currently does not provide any mechanism for cross-coordinating between tasks belonging to different jobs.  However, if your head node contains, say, 8 cores, you may want to set the resource requirement for each Copy task to 3 cores.  This will ensure that the scheduler will never be able to run more than 2 copy tasks on the head node.  Keep in mind, however, that artificially raising the minimum resource requirement for a task will correspondingly raise the minimum resource requirement for a job and will reduce cluster utilization.

    3) Generally, no.  A node-based task (a task with the UnitType 'Node') must belong to a node-based job.  Likewise, an Exclusive task must belong to an Exclusive job (although tasks in an Exclusive job do not have to be exclusive with respect to one another). 

    The one exception to this rule are Node Preparation and Node Release tasks, introduced in HPC Server 2008 R2 Beta.  These tasks run exclusively on a node with respect to other tasks in the job, even if the job itself is not exclusive.  Multiple Node Preparation and Node Release tasks from different jobs can run on the same node concurrently.  A Node Preparation task will run immediately after the node is first allocated to the job, and can be used for such tasks as copying files onto the node.  Conversely, a Node Release task will run right before the job is ready to give up all resources on a node (typically, this happens when the job ends, but there are other conditions, such as preemption, that may trigger this).  If you are running R2 Beta 2, you may consider using Node Preparation and Node Release tasks to perform file copying.

    Best regards,
    Leonid.

    Wednesday, May 26, 2010 3:59 AM
  • Thanks again for the help.

    I conclude that HPC simply doesn't handle heterogeneous, serial tasks. The scheduler design is for homogeneous or parallel tasks, meaning tasks that use the same numer of resources and aren't node-specific or tasks that can be run in parallel.  The scheduling of heterogeneous, serial tasks is poor enough that I would simplify a bit and say it simply doesn't work, HPC is not designed for it.

    So I (and anyone with heterogeneous, serial tasks) now have to wrap HPC with a higher level "job" structure to contain a set of single-task HPC jobs, say a JobGraph. And persist it, keep track of what's going on, queue up the next HPC job, etc.  Basically all the same things that HPC does with its Job-Task relationship.  Note that despite your saying the scheduler needs to reserve all resources for the Job, my JobGraph runner will not do that, each of the sub-Jobs will be submitted when the previous one is done.  So I'm going to duplicate a bunch of functionality you already have, and the net result will be identical to if you guys added a task-granularity resource reservation option, where Jobs reserve no resources, each task gets its resources reserved just-in-time.  So of course I disagree that a Job must reserve all its sub-Task's resources at the start, my implementation will not do that, and in fact the concept is foreign to me, which I guess is why it took me so long to accept that's what you guys have done.

    I'm quite surprised to learn about this issue in what appears to be a mature product.  I feel somewhat burned. I am well into coding before figuring this out, and it took some effort to sell my team on HPC. Your documentation should probably highlight this issue somewhere (or more prominently, I can't say I've read it all).

    Ryan

    Wednesday, May 26, 2010 2:11 PM