none
Server failed during deployment, cannot reimage.

    Question

  • I have started deployment on 5 nodes when the headnode failed and restarted. When it was up again, the five nodes are still in Provisioning state, but nothing happens. In the Porivisoning log the last entry is "Reverted" and in Properties - "no ongoing operations can be cancelled"; State: Provisioning. In Operations the last entry is "saving node information to file - state, commited". And it has not changed for a long time now.

    The nodes them selves were in their waiting loop for authorization from the headnode. I have now swtiched them off and will boot them up again when I can clean them up from the provisioning state so that I can start all over again.

    Any clues how to do that?

    Regards
    Ivan
    Tuesday, August 11, 2009 4:32 PM

Answers

  • Hi again Ivan
    Things seem to be a bit confused here!
    Try

    Set-HPCNodeState -force -state "offline" -name "<nodename>"

    to force the state to offline. This may also not work if the cluster considers that jobs are still running on the node. In this case a more heavy handed solution may be to remove the nodes from the cluster using

    Remove-HpcNode -Name "nodename"

    This will kick the node out entirely, but should at least allow you to redeploy from scratch.

    Let me know how you get on.
    Dan



    • Marked as answer by iivuch Wednesday, August 12, 2009 11:45 AM
    Wednesday, August 12, 2009 11:17 AM

All replies

  • Hi Ivan
    You could try the 
    Get-HpcOperation -NodeName <yournode> | Stop-HpcOperation
    powershell command
    Also, have you tried right clicking the last entry in the Operations log & choosing cancel operations?
    Cheers
    Dan
    Wednesday, August 12, 2009 8:08 AM
  • Dan,

    Glad to hear from you. I have tried the above with no effect. It seems the headnode is really confused as it shows that the nodes are online and provisioning, whilst they are actually powered down. Also, if I go to the Operations or Provisioning log, I have no option to cancel any operations.

    I have installed HPC 2008 SP1 hoping that the new service pack will resolve this, but without an actual effect.

    Regards
    Ivan
    Wednesday, August 12, 2009 10:57 AM
  • Hi again Ivan
    Things seem to be a bit confused here!
    Try

    Set-HPCNodeState -force -state "offline" -name "<nodename>"

    to force the state to offline. This may also not work if the cluster considers that jobs are still running on the node. In this case a more heavy handed solution may be to remove the nodes from the cluster using

    Remove-HpcNode -Name "nodename"

    This will kick the node out entirely, but should at least allow you to redeploy from scratch.

    Let me know how you get on.
    Dan



    • Marked as answer by iivuch Wednesday, August 12, 2009 11:45 AM
    Wednesday, August 12, 2009 11:17 AM
  • Dan,

    Tyring this, I could not get it to change its state to offline, but the Remove command worked! This is really a superb help, as it now allows me to carry on with re-deployment!

    Many many thanks!

    Ivan
    Wednesday, August 12, 2009 11:46 AM