none
set-hpcnodestate -force not working as advertised? RRS feed

  • Question

  • According to the help, this set-hpcnodestate -force -state offline -name <nodename> should force a node to the offline state without going into draining, however, every time I try this, or use the Take Offline item in the UI, the node is getting stuck in the Draining state.

    I had to rebuild a compute node due to a hardware failure and need to get the newly rebuilt node, reusing the old name, back into the HPC cluster.

    Is there a way to do this or am I stuck needing to rename the rebuilt node and having a "ghost" node in the online list forever?

    matt
    Tuesday, February 24, 2009 2:00 PM

Answers

  • Apparently the HPC Job Scheduler was in a very strange state that required multiple restarts to resolve. 

    After users complained that jobs weren't queueing, and only would show as "submitting", clusrun wasn't working either, I performed one more restart of the job scheduler, and suddenly the force options worked to take the node offline.
    • Marked as answer by msmoritz Tuesday, February 24, 2009 7:01 PM
    Tuesday, February 24, 2009 7:01 PM

All replies

  • Apparently the HPC Job Scheduler was in a very strange state that required multiple restarts to resolve. 

    After users complained that jobs weren't queueing, and only would show as "submitting", clusrun wasn't working either, I performed one more restart of the job scheduler, and suddenly the force options worked to take the node offline.
    • Marked as answer by msmoritz Tuesday, February 24, 2009 7:01 PM
    Tuesday, February 24, 2009 7:01 PM
  • Hello,

    right now I'm stuck with the same problem described above. While the hpc2008 cluster was executing some jobs I requested to go into the offline state.
    So the nodes went to the draining state, waiting till all Jobs were finished. After all jobs were done the nodes stayed in the draining state.

    Even the force command from above does not do anything.

    I restartet the Headnode several times as well as the compute nodes.
    I even did mulitple (> 20) service restarts of all compute cluster related services. Still the nodes are in the draining state.

    The deployment of another node went well.
    The only difference I can see is in the "operations" pane, where the new node has the correct naming, e.g. hpc2k8node003. All other nodes (stucked in draining) have ,e.g. hpc2k8node001$. I don't have any idea where the $ sign comes from.

    Any suggestions?

    Thanks,

    Johannes
    JH
    • Edited by Johannes_de Thursday, April 16, 2009 6:34 AM Started new thread
    Tuesday, April 14, 2009 7:04 AM
  • Cancel the operations on the nodes in the draining state, then reboot the head node and retry.  If they still won't pass draining, try another service restart.

    As I recall, I had to reboot the head node and then restart the services to get the nodes to clear out, and they finally did it without the force option.

    I've found that restarting the services doesn't sufficiently clear out the Node Manager/Job Manager states in many cases and only a full reboot corrects the managers ability to handle things.

    matt
    Friday, April 17, 2009 12:40 PM
  • Unfortunately I've done all this several times. Still no change.

    Johannes
    JH
    Sunday, April 19, 2009 5:15 AM