locked
HPC Pack 2016 U2 First Job only gets 1 core RRS feed

  • Question

  • My organization just stood up a new HPC Pack 2016 U2 cluster (with 1/4/19 QFE), using balanced scheduling and immediate preemption, with several thousand cores and a 3-way clustered headnode.   When we submit a parametric job (doesn't matter what the actual tasks are), the job only gets one core and will stay with one allocated core.  If we submit a second job, then the first job will get half the cores in the cluster, but the other will only get 1.  If we continue to add a third, then 2 will have 1/3 of the cores, but one of the jobs will always remain at one core.   

    We have used a 2012R2U3 cluster for years (not clustered headnode), which has worked fine for us.  This is new and baffling behavior.   Does anyone have an idea of what is going wrong?

    Thanks


    Thursday, January 10, 2019 10:08 PM

All replies

  • Hi AdamKobulnicky,

    Are you running HPC Pack version 5.2.6291.0 on all your nodes? Could you run 'cluscfg listparams' and post the output to check the scheduler settings? Could you run a simple parametric sweep job by 'job submit /parametric:1-100 ping -n * localhost' to repro the issue that only one core is allocated to the job?

    Regards,

    Yutong Sun

    Monday, January 14, 2019 8:34 AM
  • We found the issue, by taking different compute nodes offline, and eventually we found that if one of them was removed, the scheduler behaved as expected.  Re-installing all the software to the compute node and re-adding it to the cluster resolved the issue.  The nodes were all at 5.2.6291.  It's still a mystery as to why one bad compute node would affect the scheduling of the rest of the cluster.

    Thanks,

    Adam

    Thursday, January 17, 2019 4:00 PM