Răspuns set-hpcnodestate -force not working as advertised?

  • martes, 24 de febrero de 2009 14:00
     
     
    According to the help, this set-hpcnodestate -force -state offline -name <nodename> should force a node to the offline state without going into draining, however, every time I try this, or use the Take Offline item in the UI, the node is getting stuck in the Draining state.

    I had to rebuild a compute node due to a hardware failure and need to get the newly rebuilt node, reusing the old name, back into the HPC cluster.

    Is there a way to do this or am I stuck needing to rename the rebuilt node and having a "ghost" node in the online list forever?

    matt

Todas las respuestas

  • martes, 24 de febrero de 2009 19:01
     
     Respondida
    Apparently the HPC Job Scheduler was in a very strange state that required multiple restarts to resolve. 

    After users complained that jobs weren't queueing, and only would show as "submitting", clusrun wasn't working either, I performed one more restart of the job scheduler, and suddenly the force options worked to take the node offline.
    • Marcado como respuesta msmoritz martes, 24 de febrero de 2009 19:01
    •  
  • martes, 14 de abril de 2009 7:04
     
     
    Hello,

    right now I'm stuck with the same problem described above. While the hpc2008 cluster was executing some jobs I requested to go into the offline state.
    So the nodes went to the draining state, waiting till all Jobs were finished. After all jobs were done the nodes stayed in the draining state.

    Even the force command from above does not do anything.

    I restartet the Headnode several times as well as the compute nodes.
    I even did mulitple (> 20) service restarts of all compute cluster related services. Still the nodes are in the draining state.

    The deployment of another node went well.
    The only difference I can see is in the "operations" pane, where the new node has the correct naming, e.g. hpc2k8node003. All other nodes (stucked in draining) have ,e.g. hpc2k8node001$. I don't have any idea where the $ sign comes from.

    Any suggestions?

    Thanks,

    Johannes
    JH
    • Editado Johannes_de jueves, 16 de abril de 2009 6:34 Started new thread
    •  
  • viernes, 17 de abril de 2009 12:40
     
     
    Cancel the operations on the nodes in the draining state, then reboot the head node and retry.  If they still won't pass draining, try another service restart.

    As I recall, I had to reboot the head node and then restart the services to get the nodes to clear out, and they finally did it without the force option.

    I've found that restarting the services doesn't sufficiently clear out the Node Manager/Job Manager states in many cases and only a full reboot corrects the managers ability to handle things.

    matt
  • domingo, 19 de abril de 2009 5:15
     
     
    Unfortunately I've done all this several times. Still no change.

    Johannes
    JH