none
Deployment Services not working configuring Highly Available Headnodes RRS feed

  • Question

  • Hopefully someone can point me in the right direction on this as I have gotten nowhere at all.

    I have built a pair of HA HPC headnodes and the build for that went fine. SQL went on no problems as did HPC Pack 2008.
    Things seem to fall apart and go wrong when you start to configure HPC.

    First off it does not seem to configure any of  the WDS bits as it does in a stand alone installation. I had to manually add the Regkey for RootFolder in WDSTFTP to get nodes to PXE boot.

    Next I found that all the various PXE support files were missing so I grabbed them off a standalone headnode and dropped them in place.

    Then I found that the config had not set up the boot.wim correctly so I had to manually change STARTNET.cmd to get it to point at the correct headnode when WinPE got running.

    Now I am finally at the point where I can progress no further. I have not been able to get much information to figure out what is going on but the ExecutionClient log file gives the following (skipping first bits where it loads drivers and inits network):
    Connecting to host xxxx
    DNS Resolution
    Using IP address xxx.xxx.xxx.xxx
    Build Socket
    Connect
    COnnection Success
    Initialization COmplete!
    Sending initial startflag
    COMMAND:
    Command execution finished, sending result
    Result sent to server
    Terminating gracefully
    Shutting down connection

    Now the thing is I can find nothing on the server and the console itself terminates with the following:
    The system cannot find the file X:\Windows\nodename.hpc
    The network name cannot be found
    Invalid drive specification
    0 File(s) copied.

    As far as I can tell it should only try for nodename.hpc if there is come sort of problem connecting to the server. What am I missing and why is the documentation for configuring for HA Headnodes so sketchy?

    Andy
    Tuesday, June 23, 2009 12:20 PM

Answers

  • Okay I figured out what the above effectively meant.

    It looks like node was booting from the wrong NIC. PXE was taking place on the Public LAN instead of the Private LAN.

    Once I changed the BIOS to PXE boot only off the private LAN it seems to be booting correctly.  Now to see what fails next.

    Cheers for all the suggestion.

    Andy
    • Marked as answer by CrispyLurker Friday, July 3, 2009 3:19 PM
    Friday, July 3, 2009 3:19 PM

All replies

  • Hi Andy
    Coincidentally I've been rolling out a cluster with high availability headnodes over the past week, but have not seen the issues you mention. From your email it seems like you are quite familiar with WDS configuration etc, so I apologise if these question is a bit basic, but did you run the HPC pack install on both headnodes one after the other? Have you tried failing the cluster resources over to the other headnode & tried a deployment from there?
    When you say that the documentation is sketchy, were you following the step by step guide here http://technet.microsoft.com/en-us/library/cc719006(WS.10).aspx
    Cheers
    Dan
    Wednesday, June 24, 2009 8:47 AM
  • Hi Dan,

    Thats how I did the install. The only difference is I am running SQL2008. It is possible I suppose that there is something not quite right there that is messing things up but I would have expected to see some other kind of error generated.

    Will try again using 2k5 I think.

    Cheers

    Andy
    Wednesday, June 24, 2009 12:02 PM
  • Sounds like a good plan Andy.
    I have noticed some issues later in the deployment process, post installation of .net3.0, which appear to be transient. I've not investigated too far as of yet but I'd be interested in hearing how things go for you at that stage when you come to node deployment again.
    Regards
    Dan
    Wednesday, June 24, 2009 1:22 PM
  • Well that doesnt work either.

    Everything was set-up as per instructions but I think something is still failing without generating an error.
    Installation runs through fine and does not generate any errors however when we get into HPC running a test against the headnodes to confirm all services running fails:
    Service Name Status
    HPC Management Service Running
    HPC Node Manager Service Running
    HPC Basic Profile Web Service Stopped
    HPC Job Scheduler Service Stopped
    HPC SDM Store Service Stopped
    SQL Server (COMPUTECLUSTER) Stopped
    Windows Deployment Services Server Stopped
    DHCP Server Running

    In Failover Cluster Management everything is online. Aside from the Disk & File services the other resources are:
    HPC Scheduler Services
    HPC SDM Service
    SQL Server
    SQL Server Agent
    SQL Server Fulltext

    Now I suspect that at the very least I should also have WDS in there. Incidentally it still does not install all the WDS support files in ~\ Data\Boot\ though adding drivers does update the Boot.wim as you would expect. As well as this the registry key RootFolder is missing in TFTP so PXE boot does not work.

    Still at a complete loss as to what the issue may be. I did wonder for a while if the build service was not supported on HA Headnodes like using them as compute or broker nodes.

    Andy
    Thursday, June 25, 2009 7:24 AM
  • Hi Andy, that's interesting.
    The 'All services running' diagnostic test output you quote is correct for the inactive headnode, other than the WDS service which should be running. You should also see diagnostic output for the active headnode.
    Only the scheduler and SDM HPC services are clustered in HA configuration. These are the services which require consistency of availability, the HPC management services listed are not required to failover for continuation of service and should be running on both nodes. I'm seeing the following service status here:

    inactive head node:
    Service Name Status
    HPC Management Service Running
    HPC Node Manager Service Running
    HPC Basic Profile Web Service Stopped
    HPC Job Scheduler Service Stopped
    HPC SDM Store Service Stopped
    SQL Server (COMPUTECLUSTER) Stopped
    Windows Deployment Services Server Running
    DHCP Server Running

    active head node:
    Service Name Status
    HPC Management Service Running
    HPC Node Manager Service Running
    HPC Basic Profile Web Service Stopped
    HPC Job Scheduler Service Running
    HPC SDM Store Service Running
    SQL Server (COMPUTECLUSTER) Running
    Windows Deployment Services Server Running
    DHCP Server Running

    The WDS & DHCP services should be running on both head nodes, and nodes should be able to use either (images are held on a clustered filestore resource)
    This video http://resourcekit.windowshpc.net/IT%20PRO/Videos1/HPCHighAvailability.wmv has some interesting info on how HA works in this scenario.
    Still looks like you're having WDS issues though, out of interest which network model are you using, and are you using the headnodes for dhcp on your private (management) network or do you have a discrete DHCP server?
    Cheers
    Dan
    PS WDS is supported on HA headnodes :)

    Thursday, June 25, 2009 8:28 AM
  • Andy, 
    you were able to tftp down the boot image and execution client, so i suspect the issue you are seeing may not be related to WDS as such.
    look at the management txt log ( in the  \Program Files\Microsoft HPC Pack\Data\LogFiles directory)
    on the head node, start the cluster manager, in the node management pane take a look at the 'operations'-- look thru the failed operations and tell me if you see anything that seems suspect.
    Another place to look at would be the 'provisioning log' of the node you tried to provision, it might have better information( or it might just show a time out..).

    thanks
    -parmita
    pm
    Thursday, June 25, 2009 5:47 PM
    Moderator
  • HI folks,

    I may be a little slow in responding for a while as I have a number of other project running that are now pulling me away from this HPC deployment (or not as the case may be).

    The situation now is that I have the HA Headnodes rebuilt running SQL2k5 this time and it did seem to set up WDS correctly this time.. on the active headnode at least. It created all the WDS support files, modified boot.wim and configured WDS correctly. On the failover node the files replicated correctly but it did not create the RootFolder reg key for the WDS TFTP Provider  or automatically start the WDS Service. These weer trivial to fix so I am not bothered.

    The issue now is that the compute node of build boots and loads WinPE then sends the request to the headnode to get started (am at home and not looking so this is from memory)
    The XML request goes through and it does a 5 second wait then retries... a few times. At this point if I am quick enough I can catch it on the headnode and occasionally get the machine to start building. If I can get a build started then all is well.

    The thing if this is not usual behaviour. On the single stand alone headnode clusters I have it boots up, establishes a name for the new node and then pretty much waits until I tell it what to do from the headnode. From the HA nodes it doesnt do this.

    Additioanlly the logs from the compute nodes are not getting written back to the headnode, it looks like it is not getting the build account details. If I for "net use" with the build account pointed at the headnodes then rerun startnet then it dumps the logs but again doesnt really tell me anything other than what I originally posted.

    This to me seems to be very odd behaviour really.

    A
    Friday, June 26, 2009 10:07 PM
  • The DHCP/WDS configurations should be identical in terms of the registry configuration. Adding entries might actually be working against you in that regard. If your network and NICs are all behaving as they should and properly configured, you don't have to touch anything in the registry or file system. You should review your operations log to determine if there were any non-fatal warnings that could be attributing to the passive node's behavior.

    I'm at home right now so I will limit this response. It helps greatly when triaging PxE deployment to be able to view the target compute node's video output. If you don't have an IPMI solution, you should have a crash cart with a KVM handy so you can monitor CN state.

    One way to sort out errors is to shutdown the passive HN. This forces all PxE and boot traffic to one server. If everything works as expected, shut the active node down, bring up the passive node (making it active) and retry your tests. Normally, boot servicing is distributed across the two DHCP servers (not through any formal mechanism like round robin however). If one system is working but the other is not then you might see unusual behavior as a result.

    1) WDS on the passive node. It should have started and not require any registry manipulation. You should troubleshoot why it didn't start. A way to get started is to view the Win32 error code when the service fails; from the cmd line, type:

    sc query wdsserver

    This will display the Win32 error code which should give some indication of what the failure is.

    2) Waiting for authorization: if I interpret what you said correctly, the "waiting for authorization" phase is what you're describing next. This isn't WinPE but a very small boot program that spins. It should spin for a while (don't have the number handy but it should be many minutes). If this is timing out in a few seconds, then I'm unsure as to what could be wrong. If you've set up your default boot order to put private network PxE boot first, then the looping behavior will end, system will reboot but then the looping program will get loaded again. This should go on for hours until you authorize or reject the node.

    3) The logs you mention. This is the ExecutionClient log and from your description, you are referring to the WinPE configuration phase. Under normal circumstances, no log is written back to the HN unless there is an error. If you want the log to be copied all the time, you can mount the WinPE boot image and modify startnet.cmd to do that. You are correct in that the deployment account (build account as you have called it) credentials are sent to ExecutionClient so that it may log in to the REMINST share in order to continue deployment.

    I would urge you to peel back the onion so to speak and triage the earliest failures. Test your HA HNs one at a time so we can help you resolve your issues.

    C
    Monday, June 29, 2009 4:41 PM
  • I have tried shutting down the passive node and still no joy. It seems to run through this GetNextTask a few times then just quits.

    Details below. Any clues?

    **** Retrying GetNextTask in 5 seconds ****
    **** Remoting Data missing in head-node transmission ****
    Raw Data from server:
    <SOAP-ENV:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:x
    sd="http://www.w3.org/2001/XMLSchema" xmlns:SOAP-ENC="http://schemas.xmlsoap.org
    /soap/encoding/" xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmln
    s:clr="http://schemas.microsoft.com/soap/encoding/clr/1.0" SOAP-ENV:encodingStyl
    e="http://schemas.xmlsoap.org/soap/encoding/">
    <SOAP-ENV:Body>
    <i2:GetNextTaskResponse id="ref-1" xmlns:i2="http://schemas.microsoft.com/clr/ns
    assem/Microsoft.ComputeCluster.Management.ICommandServer/Microsoft.Ccp.ClusterMa
    nagementInterfaces">
    <return href="#ref-4"/>
    </i2:GetNextTaskResponse>
    <a1:TaskDescription id="ref-4" xmlns:a1="http://schemas.microsoft.com/clr/nsasse
    m/Microsoft.ComputeCluster.Management.TemplateModel/Microsoft.Ccp.TemplateModel%
    2C%20Version%3D2.0.0.0%2C%20Culture%3Dneutral%2C%20PublicKeyToken%3Dnull">
    <sleep>10000</sleep>
    <command id="ref-5"></command>
    <operationId href="#ref-5"/>
    <reboot>false</reboot>
    <terminate>false</terminate>
    <sendProgress>false</sendProgress>
    <env_vars xsi:null="1"/>
    </a1:TaskDescription>
    </SOAP-ENV:Body>
    </SOAP-ENV:Envelope>

    **** Retrying GetNextTask in 5 seconds ****
    COMMAND:
    **** Command execution finished, sending result ****
    **** Result sent to server ****
    **** Terminating gracefully ****
    **** Shutting down connection ****
    The system cannot find the file X:\Windows\nodename.hpc.
    Access is denied.
    Invalid drive specification
    0 File(s) copied

    Friday, July 3, 2009 2:24 PM
  • Okay I figured out what the above effectively meant.

    It looks like node was booting from the wrong NIC. PXE was taking place on the Public LAN instead of the Private LAN.

    Once I changed the BIOS to PXE boot only off the private LAN it seems to be booting correctly.  Now to see what fails next.

    Cheers for all the suggestion.

    Andy
    • Marked as answer by CrispyLurker Friday, July 3, 2009 3:19 PM
    Friday, July 3, 2009 3:19 PM
  • This indicates to me that there is a mismatch between the version of ExecutionClient that is part of the WinPE download and the headnode OS. ExecutionClient expects the SOAP emitted by the headnode to be of a certain form and the "Remoting Data missing" message is an indication that the SOAP message is different.

    By any chance, are you running a beta version of Server 2008 R2 on the headnode? If you are running WS08, I'm wondering if there is a patch that might be causing some issues. I need to do some research but confirming the version of the OS you're using would help.

    Monday, July 6, 2009 6:59 PM
  • Am running Server 2008 Enterprise x64 SP1
    Ver > 6.0.6001

    As I mentioned before swapping the compute node so that it only PXE booted off the private LAN seemed to fix the problem I was seeing. Now just seeing errors later in the deployment, including nodes getting stuck in provisioining.

    A

    Tuesday, July 7, 2009 7:05 AM
  • Hi Andy
    Glad that you managed to work around your original issue. My compute nodes are isolated on private network only, so that problem did not raise it's head.
    As I mentioned earlier in this thread, however, I did encounter isssues later on in the deployment process. Do you have more information (maybe the node provisioning log) to hand?
    Cheers
    Dan 
    Tuesday, July 7, 2009 9:35 AM