Nodes remain in Draining state. All jobs finished; Forced operations not working. RRS feed

  • Question

  • Hello,

    right now I'm stuck with the same problem described here:

    While the hpc2008 cluster was executing some jobs I requested the compute nodes to go into the offline state.
    So the nodes went to the draining state, waiting till all Jobs were finished. After all jobs were done the nodes stayed in the draining state.

    Even the force command from above does not do anything.

    set-hpcnode -force -state offline ...

    I restartet the Headnode several times as well as the compute nodes.
    I even did mulitple (> 20) service restarts of all compute cluster related services (HPC**** +  SQL Server ). Still the nodes are in the draining state.

    The deployment of another node went well.
    The only difference I can see is in the "operations" pane, where the new node has the correct naming, e.g. hpc2k8node003. All other nodes (stuck in draining) have ,e.g. hpc2k8node001$. I don't have any idea where the $ sign comes from.

    Removing a compute node from the entire domain does not change anything. So my best guess at the moment is, that some update brought confusion to the node names.

    Any suggestions?

    Thursday, April 16, 2009 6:34 AM


All replies

  • look at the application log and the hpc management log( in the data directory under program files) to see if you find something suspicious.
    I would have deleted the compute node entries, but seeing that they are stuck in the 'draining' state, don't know if this will be successful.
    unfortunately at this point, if this is  not a production clulster and you don't mind losing the job history and other information you might have in the head node database , uninstall the hpc pack on the head node and re-install. Since the compute nodes are already configured, they will come up in the amdin console as 'unknown' and you shoudl be able to assing a non-imaging template to add them to the cluster.
    let me know how this goes.


    Thursday, April 16, 2009 10:09 PM
  • Hi Parmita,

    thanks for your reply.

    I tried already all console/powershell based commands (forced as well) to get the nodes offline or to remove the nodes.
    Everything was not successfull.
    Indeed this cluster is non production. However I wonder what would I do if this is not the case.

    I already strolled through the logs, However I didn't find a thing pointing me directly in the right direction.
    Currently I investigate some access errors using the NTAuthority Anonymous Account ,but  the right administrator credentials are provided.

    The funny things stays, besides all possible Domain and Group policy settings and screw ups, that I integrated a new node without a problem and can change its state   to offline and back to  online as often as I'm pleased.



    Friday, April 17, 2009 6:50 AM
  • Hi,

    just for the case it could be related... i had some trouble with the draining state (again) some time ago.

    The set-hpcnodestate -force output stated something about an incorrect jobID. Since (luckily) this was a non production system, too i dared to have a look at the database itself and found out that in the resource table 2 Cores were reserved by a jobID which did not exist in the job table anymore. I have no clue what happened and why. I changed some values to corresponding ones of the other cores of the same node and afterwards i was able to force the node offline again. I guess this is a hazardous and surely unsupported way, but since it was a test machine i just wanted to do some investigation. Maybe it could be helpful.

    The patch for the original draining issue was installed on that machine. This one did already save me some weeks ago :-)


    Friday, April 17, 2009 7:49 PM
  • Hi,

    If you type NODE LIST at a command line, what is the state show for the node.  Also, if you type NODE VIEW <nodename> can you share the result?

    The above will query the scheduler to determine what the schedule thinks is going on. 

    Friday, April 17, 2009 9:47 PM
  • Hi,

    NODE LIST returns that the nodes are offline.
    Still the HPC Cluster Manager shows Draining and ongoing operation in the overview.
    But "No ongoing operations to be canceled" in the Properties pane of the current node.

    PS > node list
    Node Name           State       Max Run Idle
    ------------------- ----------- --- --- ----
    HPC2K8MASTER        Offline     2   0   0
    HPC2K8NODE001       Offline     2   0   0
    HPC2K8NODE002       Offline     2   0   0
    HPC2K8NODE003       Offline     2   0   0
    HPC2K8NODE004       Offline     2   0   0
    HPC2K8NODE005       Offline     2   0   0
    HPC2K8NODE006       Offline     2   0   0
    HPC2K8NODE007       Offline     2   0   0
    HPC2K8NODE008       Offline     2   0   0
    HPC2K8NODE009       Unreachable 2   0   0
    PS > node view hpc2k8node001
    System Id                       : 3
    System GUID                     : f1f7b618-2ea9-4e79-a194-1a105aaa879c
    Job Types                       : Batch, Admin, Service
    State                           : Offline
    Number Of Cores                 : 2
    Number Of Sockets               : 2
    Offline Time                    : 02.04.2009 15:59:16
    Online Time                     : 09.02.2009 10:05:38
    Security Descriptor             : S-1-5-21-1263207901-2180335630-1282098717-2193
    Memory Size                     : 2047
    CPU Speed                       : 3200
    Node Groups                     : ComputeNodes



    Monday, April 20, 2009 6:05 AM
  • That means scheduler has already treated those nodes offline, that means you cannot schedule any jobs against them, while admin console still presents them at draining state.

    Please try to restart SDM service on head node to see whether it solves the problem. We will look deeper into it on our side as well.

    Monday, April 20, 2009 6:40 AM

  • Please try to restart SDM service on head node to see whether it solves the problem. We will look deeper into it on our side as well.


    I've done that several times and still no change.

    In contrast to Node list and node view, the  command:

    PS > set-hpcnodestate -force -state offline -name hpc2k8node008

    says the node is draining:

    NetBiosName               NodeState       NodeHealth      Groups
    -----------               ---------       ----------      ------
    HPC2K8NODE008             Draining        OngoingOpera... ComputeNodes


    I searched some more event logs and found the following in "Windows HPC Server":
    Access is denied to user 'NT AUTHORITY\ANONYMOUS LOGON'.



    Monday, April 20, 2009 6:49 AM
  • Hi everyone,

    I don't know if this would be helpful in your investigation to fix the issue, but I've noticed in our environment that the nodes that get stuck in the draining-to-eternity state are the ones that were added as pre-configured nodes. Any nodes I add from bare metal have no problems going offline from the GUI interface (Admin Console).

    Note regarding my pre-configure nodes: These are just nodes that I previously added through bare metal. It's just that I re-installed the HPC pack on the head node so I just re-added them to the cluster as pre-configured, so they should not be any different than those I add through bare metal afterwards, unless I'm missing something here.

    Also: I noticed that I cannot use the "Run command" on the nodes that have been added as pre-configured. Don't really know if this is related to them not being able to go offline either.


    Richard P.

    Monday, April 20, 2009 2:40 PM
  •  the nodes that get stuck in the draining-to-eternity state are the ones that were added as pre-configured nodes.


    thanks for your suggestions. However in my case all nodes were added from bare metal. They got stuck in the draining state after online operations.


    Tuesday, April 21, 2009 5:40 AM
  • try this patch:

    be mindful, however that if there is a reboot pending -- it might reboot your head node.

    let me knwo how that went.
    • Marked as answer by Johannes_de Friday, April 24, 2009 6:50 AM
    Thursday, April 23, 2009 10:19 PM
  • Hi parmita mehta,

    thank you very much. With the patch applied the force to offline state worked.

    Just wondering why the Google and MS Search didn't show up this patch... at least not in an other language than japanese.


    Friday, April 24, 2009 6:50 AM
  • Hi, I met the same problem now. But the link does not work, can you tell me what the content is. Thank you very much! :) 
    Tuesday, July 23, 2013 5:41 AM
  • Finally I find it. Is it? http://support.microsoft.com/kb/967222/en-us
    Tuesday, July 23, 2013 6:50 AM