none
Issue deploying Bare Metal from a CN template on a failover head node cluster

    Question

  • I'm running a 2008 R2 HPC Cluster, and I'm getting the following output while trying to deploy a WIM through a compute node template (bare metal).
    The head node, US001SH0001, is a member of a failover cluster.  The head node application (file server) is US001SH0000.
    (It's currently a single node failover cluster, as I'll be adding the 2nd head node to the failover cluster once it's available.)
    I'll just show the output first:

    Contacting CommandServer on HeadNode US001SH0001
    (GUID, MAC, Network Adapters found)
    ****Initializing Server Proxy
    ****Connecting to host US001SH0001
    ****DNS Resolution
    ****Using IP address 192.168.1.1
    ****Build Socket
    ****Connect
    ****Connection failure: A socket opertation was attempted to an unreachable network (10051)
    ****IP address list has been exhausted for host name US001SH0001
    ****Waiting to retry
    ****DNS resolution
    ****Using IP address 192.168.1.1
    ****Build socket
    ****Connect
    ****Connection Success
    ****Initialization complete!
    ****Sending initial start flag
    COMMAND: net use /delete z: & net use z: \\US001SH0000\REMINST "*******" /user:"*******"
    The network connection could not be found.

    More help is available by typing NET HELPMSG 2250.

    System error 53 has occured.

    The network path was not found.

    ****Command execution finished, sending result 2
    ****Result sent to server
    COMMAND: net use /delete z: & net use z: \\US001SH0000\REMINST "*******" /user:"*******"
    The network connection could not be found.

    looping, etc etc etc

     

     

    The compute node I'm trying to image can't resolve US001SH0000, the virtual instance of the head node.  I've tried to find a way to edit the template to use the IP instead, or to use US001SH0001 (the physical head node, since the CN CAN resolve to this name), but this command isn't a configurable step of any template (seems to be issued between steps 1 and 2). 

    I see that in step 11, there's a task to Mount a share in the same fashion, with path \\%CLUSTERNAME%\REMINST.  I'm simply unable to modify the step to modify the initial mount.

    Since I created this template while the HPC cluster manager was connected to US001SH0000, I even deleted the template and created a new one while the cluster manager was connected to US001SH0001, but this didn't change anything.

    I created a Node XML file but haven't used it, since I realized the issue lies in the template itself.

     

    Monday, February 21, 2011 9:42 PM

All replies

  • UPDATE:

    To work around this issue, I had to set up a "015 DNS Domain Name" for the private network in the DHCP scope of the head node US001SH0001.

    I then had to insall the DNS Server role, added the private network domain name as a forward lookup zone in DNS, and added a host record for the head node instance, US001SH0000 and point it to 192.168.1.1.

    If there is no other fix for this, it should at least be included in the procedure for failover head node HPC setups:
    http://technet.microsoft.com/en-us/library/hpc_cluster_head_node_failover_cluster(WS.10).aspx

    Hopefully this helps someone out there!

    Tuesday, February 22, 2011 3:51 PM
  • UPDATE 2:

    This got the windows image on the node, but the rest of the template would fail when attempting to join to the domain.
    I had to add forwarders to a couple domain controllers on the DNS server that I added to my head node.  This kept the DNS requests for the domain name from stopping at the head node itself.

    Also, I was having lots of issues with the step where "Checking the configuration of node XYZ" would error out: 
    "Cound not contact node XYZ to perform change.  Authentication failed.  A call to SSPI failed, see inner exception.
    "Cound not contact node XYZ to perform change.  The management service was unable to connect to the node using
    any of the IP addresses resolved for the node."

    I BELIEVE that adding a DHCP reservation on the head node for each Compute node, complete with mac for the private network NIC, using XYZ.local fixed this (being that .local was the domain I had to create in the forward lookup zones).  Even after I added these reservations and kicked off the whole image job again, the authentication error appeared seven times before it started to work and continue applying the template.

    Wednesday, February 23, 2011 2:10 PM
  • Hi Phil-D,

    Sorry to hear about your troubles but I'm not sure I understand what you mean by "a single node failover cluster".  If you check the URL you've provided above, you will find a link to the 'requirements' section where the first hardware requirement mentions "You need two servers for the failover cluster that runs the head node..."

    Hopefully your deployment will be smoother when your second head node is available.  If not, please let us know.

    Thanks,
    --Brian

    Thursday, February 24, 2011 9:17 AM
  • Thanks Brian,

    It simply seemed like it would would work...not sure why it wouldn't mount the shares correctly.

    I in fact have gotten the 2nd node, created a perfectly validated failover cluster, and yet HPC Pack will not install on either.  The progress gets to "Configuring high availability resources", and after a short time, begins rolling back, citing only a fatal error.

    Another oddity I noticed before, and have noticed now since I had to create a single head node HPC cluster, is that during bare metal deployment, the provisioning process stops at "Installing Windows (Expected time:  30 minutes)".  The compute nodes themselves have windows installed, logged in as local admin, and sitting there staring at the server manager window.  After 4 hours, the provisioning process times out, and disassociates the template from the nodes.

    To get around THIS issue, I discovered that manually joining the nodes to the domain, rebooting, installing HPC Pack, and applying the default compute node template without an OS actually works for getting them added to the HPC cluster.

    This just seems like way too many headaches.

    Friday, March 4, 2011 7:36 PM
  • Brian,

    I've had a few people looking at this most current issue with me.  I've got a two-node, validated failover cluster running.  When installing HPC Pack, specifically with the HPC Pack Server Components,  it rolls back with a fatal error.  Here's more info:

     

    I've come up with a correlation. During the install of the HPC Pack server components, where it sits at "Configuring High Availability Resources", and the log shows

    CAQuietExec: 16:27:56.905: F3- 369: Bringing group 'US001SH0000' online

    CAQuietExec: 16:27:56.937: F3- 402: Group 'US001SH0000' state is Pending; sleeping

    CAQuietExec: 16:27:57.951: F3- 402: Group 'US001SH0000' state is Pending; sleeping

    CAQuietExec: 16:27:58.965: F3- 402: Group 'US001SH0000' state is Pending; sleeping

    .........................

    CAQuietExec: 16:28:58.253: F2- 502: Setting install state to 128 for this node

    CAQuietExec: 16:28:58.300: F1- 271: Installed Failed: Exception @ line 410 in file 3 - error 997: File Server Group failed to go online

    CAQuietExec:

    CAQuietExec: Error 0x800703e5: Command line returned an error.

    CAQuietExec: Error 0x800703e5: CAQuietExec Failed

    ..........................

    ....we found that the SQL logs simultaneously pop up with this error:

    login failed for user 'tp\us001sh0000$'. reason: token-based server access validation failed with an infrastructure error. client 10.113.38.151

    (note that US001SH0000 is the clustered file server - IP 10.113.38.150, and US001SH0001 is the physical "first head node" - IP 10.113.38.151)

    The account tp\us001sh0000$ was not created by my SQL admin, but by the HPC Pack install at some point. We could not figure anything out past this, but it seems most related.

    Friday, March 4, 2011 10:52 PM
  • Hi Phil:

    With your new problem for installing head node, You're right about scoping down the problem to the DB connectivity issue. Though it's expectd to have 'tp\us001sh0000$' connect to DB.

    Can you check SQL Server, if 

    • login tp\us001sh0000$ is present and enabled in server level.
    • user ccpheadnode is present in database level, for all 4 HPC DB.
    Saturday, March 12, 2011 3:32 AM
  • Hi Zhen,

    I checked the remote SQL 2008 R2 Server DB, and yes, the computer account login tp\us001sh0000$ is present and enabled at the server level. 

    This account has a user mapping for each HPC DB, each named ccpheadnode, each is dbo.

    Monday, March 28, 2011 4:15 PM
  • Phil,

    Can you let me know your email so that we can have shorter turnaround to solve this issue? I'm involving SQL Server support engineer to help sort this out. So would be helpful if you can

    • Verify the Windows Authentication works in that environment, by trying things like creating a login for another domain account, grant some access and connect from other client using this credential.
    • Send me zhenwei@microsoft.com the Sql server log file, which contains the login failure you mentioned earlier.

    Thanks.

    Wednesday, March 30, 2011 8:36 AM
  • Hi,

    did you eventually fix this issue and could you share it? I encountered a problem that looks very similar: http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/a1e5c36f-c141-4b44-aaeb-e551f84b8392/#a1e5c36f-c141-4b44-aaeb-e551f84b8392

    Thanks in advance,
    Christoph

    Monday, January 23, 2012 8:55 AM
  • Christoph,

    It does look similar.  Very frustrating, but never resolved.  We had a vendor tell us to install enterprise SQL along side of HPC on the failover cluster itself.  This of course was a workaround, but it eliminated the need for the installer to talk to a remote sql server and "sleep".

    We had so many issues, not to mention all the custom tweaks that each softare needed (and thus had to be done twice), that we've opted to use 3 head nodes now, creating 3 discrete clusters.

    Wednesday, February 15, 2012 7:03 PM
  • Christoph,

    It does look similar.  Very frustrating, but never resolved.  We had a vendor tell us to install enterprise SQL along side of HPC on the failover cluster itself.  This of course was a workaround, but it eliminated the need for the installer to talk to a remote sql server and "sleep".

    We had so many issues, not to mention all the custom tweaks that each softare needed (and thus had to be done twice), that we've opted to use 3 head nodes now, creating 3 discrete clusters.

    Hi Phil,

    one of my colleagues finally got it working by

    • formatting both head nodes of our HA cluster and re-installing the OS
    • un-installing the whole remote SQL server instance (not only dropping the DB, but the whole instance) and re-creating it
    • installing the HA cluster and HPC cluster afterwards.

    We had an installation failing before the problem occurred, so we assume that leftovers of that installation remained somewhere. Now, everything is working as intended.

    Btw., it is interesting that you could install the SQL server into the same HA cluster as HPC - I thought that was not supported.

    Thanks for your response,
    Christoph

    Thursday, February 16, 2012 8:25 AM