locked
HPC Pack: Determining Cores to be allocated in Activation Filter RRS feed

  • Question

  • Hi,

    I'm running a cluster with Windows HPC Pack 2016 and am working on a custom Activation Filter for an application. Unfortunately the application needs to have the number of cores specified in the command line. The cluster has compute nodes with various core counts and I want to be able to properly prepare a task, during Activation Filter 'time', with the a command line in which the correct amount of cores being allocated is specified.

    Unfortunately during Activation Filter, the parameter passed to the filter that specifies 'Resource Count' , seems to just count Nodes, and not Cores. I'm unable to determine which node in specific is being specified to be able to retrieve the core count for the purpose described. I've tried probing all node cores during this step but at this time they are not allocated to the job yet.

    The question is: In the Activation Filter, how can I determine the amount of cores that will be allocated to the job once the filter terminates successfully?

    Thanks,

    -Michael
    Wednesday, January 16, 2019 5:23 PM

Answers

  • Hi Michael,

    Suppose your jobs were set as Exclusive usage on nodes? You may check it by running 'job view <jobid> /detailed | findstr IsExclusive'.

    If a job is set as Exclusive, then the resource count passed to the activation filter would be in number of nodes. You may need the work around to calculate the cores.

    Regards,

    Yutong Sun

    • Marked as answer by MichaelEnders Tuesday, January 29, 2019 1:02 PM
    Tuesday, January 29, 2019 9:34 AM

All replies

  • Hi Michael,

    What's the min/max and resource type settings of your job? Basically the number of resources (core/socket/node) allocated to a running job cannot be decided when the job is in queued state, unless in a specify setting e.g. min equals max for a certain number. Suppose you may estimate the resource count by max resources of the job?

    Regards,

    Yutong Sun

    Friday, January 18, 2019 8:44 AM
  • Hi Yutong,

    The job's type is 'cores'. The min/max core settings of the job is set to cover the range of available cores in the nodes of our cluster. Currently min is 8 and max is 32. My understanding is that when resources to a job becomes available the scheduler will start the activation filter. At this point I expect that the scheduler already knows how many cores it would allocate to the job if the filter returns StartJob back. It tells the script as a parameter the resource count. Unfortunately this resource count is not of the same type as the job. Instead of a value that represents the amount of cores that would be allocated, it provides the amount of nodes that will be allocated. If I knew which nodes were being considered for allocation I could probe it to see how many cores it had available, but I don't know which nodes they are.

    Currently I have found a workaround but I would rather use a more direct approach. My current approach uses an activation filter that adds a 'dummy' task that keeps the job running (CommandLine = "sleep 10") and starts the job. It also starts an auxiliary program that I wrote that then connects to the scheduler, opens the job, discovers the assigned nodes, and then probes each core of those nodes to see which are assigned to this job. After counting the amount of cores assigned, I add a new task to the job with the real command line in which I can now specify the amount of cores allocated.

    If there's a better way of doing this please let me know.

    Thanks,

    -Michael

    Friday, January 18, 2019 4:23 PM
  • Hi Michael,

    You are right, the resource count is a parameter provided by the scheduler. However it should be in the same resource type as the job. In you case, the resource count should be in 'cores'. Could you double confirm the resource count is not in Cores?

    Regards,

    Yutong Sun

    Monday, January 21, 2019 3:40 AM
  • Hi Yutong,

    I can confirm that the value passed as 'resource count' seems to be the amount of Nodes even though the job resource type is set as Core. I either request 12 cores (which would require 1 node), or 24 cores (which would require 2 nodes). In the first case the value of resource count passed to my activation filter is 1, and in the second case the value passed 2.

    Again, this is implemented as an executable called as the global activation filter.

    Regards,

    -Michael

    Tuesday, January 22, 2019 3:40 PM
  • Hi Michael,

    I cannot repro this 'resource count' issue either using cluster wide global activation filter or job template filter. The resource count would be the number of resources (Core/Socket/Node) allocated to the job. E.g. if the job is set to 12-12 Min/Max Cores, the resourceCount would be 12. Please double check the job Min/Max settings and the args passed to your executable activation filter.

    Regards,

    Yutong Sun

    Wednesday, January 23, 2019 7:39 AM
  • Hello Yutong,

    Unfortunately I did double check and I'm still experiencing the behavior mentioned on 

    The Job has type of resource set to 'Core' and Min and Max settings set to 12.

    When the job runs this is the information passed to the activation filter (I write the number of the parameter next to the parameter for information):

    Starting new activation filter pass (1/28/2019 8:54:38 AM)...
    	Job ID = 179
    	Parameters:
    		Scheduler Pass=10360 ; 1st Parameter passed
    		JobIndex=1                 ; 2nd Parameter passed
    		Backfill=True               ; 3rd Parameter passed
    		ResourceCount=1        ; 4th Parameter passed


    When I change the Min/Max setting to 24 cores, which should use two compute nodes, the 4th parameter then changes to 2 which I believe indicates the number of nodes.

    Is there anything else I can provide to try to confirm this issue? As mentioned I have a work around, but would like to use it as intended.

    Thanks,

    -Michael


    Monday, January 28, 2019 2:03 PM
  • Hi Michael,

    Suppose your jobs were set as Exclusive usage on nodes? You may check it by running 'job view <jobid> /detailed | findstr IsExclusive'.

    If a job is set as Exclusive, then the resource count passed to the activation filter would be in number of nodes. You may need the work around to calculate the cores.

    Regards,

    Yutong Sun

    • Marked as answer by MichaelEnders Tuesday, January 29, 2019 1:02 PM
    Tuesday, January 29, 2019 9:34 AM
  • Bingo!

    Yes, all jobs are sent with the IsExclusive toggle on. I think this is what we were missing. Unfortunately I do want a way to guarantee that no other job is running on the node (it is also a requirement from the application). I will keep using the work around.

    Thanks!

    -Michael

    Tuesday, January 29, 2019 1:02 PM