locked
Job does not consume all available nodes RRS feed

  • Question

  • Hello - We have a cluster in azure in which nodes are deallocated until a job is submitted.  Once we submit the job, the cluster starts to spin up but our job is never allocated the full available nodes.  It starts to grow but then stops at some point.  There is no consistencies to the number of nodes assigned.  Sometimes, 30, sometime 129, etc.  Our scheduler is setup with queued, immediate pre-emption and task level pre-emption.  How do I go about troubleshooting this?  This cluster is running HPC PACK 2012, update 3.

    thanks


    • Edited by str8ace Thursday, January 17, 2019 9:51 PM
    Thursday, January 17, 2019 9:49 PM

All replies

  • Hi str8ace,

    Suppose you are using the auto grow shrink service to grow nodes for the jobs. Could you post the auto grow shrink service configurations by the Powershell command 'Get-HpcClusterProperty -AutoGrowShrink'? Remember to call 'Add-PSSnapin microsoft.hpc' beforehand. When you say 'out job is never allocated the full available nodes', do you mean the nodes do not grow for the jobs or the nodes do grow but the scheduler does not allocate the nodes for the jobs? If former, we need to check the management service logs (HpcManagement_*.bin files) under %CCP_DATA%LogFiles\Management folder on the head node to see why the auto grow does not work for the jobs. If latter, we need to check the scheduler service logs (HpcScheduler_*.bin files) under %CCP_DATA%LogFiles\Scheduler folder on the head node to understand why the scheduler does not allocate resource for the jobs. You may email the logs or download link of them to hpcpack@microsoft.com for us to have a check.

    Regards,

    Yutong Sun

    Friday, January 18, 2019 7:32 AM