Hello,
This past weekend, a user submitted upwards of 4000 jobs to be run on a group of 8 nodes. The same evening a user's daily job was submitted to a different group of 160 nodes (none of the nodes overlap). And yet another user submitted a weekly job the next
day that doesn't request a group of nodes, but will take any available.
When I arrived at the office this morning, the weekly job and the nightly job were stuck in a queued state for seemingly no reason. Additionally, right now in the group of 160 nodes, many of them (about half) indicate that 8 cores are in use according to
the cluster manager, despite there being only a dozen single-node jobs in the running state. The rest of the jobs are queued even though half of the group has all cores available.
We checked the event logs and parsed the binary HPC logs and nothing really stood out to us.
What might have caused this? How can we further troubleshoot? What is the best way to prevent these jobs from queuing when there are resources available?
Thank you