Answered by:
Can't add Compute node on a virtualized environment

Question
-
Hi,
I'm trying to setup a Windows 2008 R2 HPC cluster using VMWare Workstation virtualization just for doing some tests. I'm using virtualization because I don't have physical machines to install the software.
The steps I've done are as follows:
1. On a new virtual machine I installed Windows Server 2008 R2 HPC Edition and named it WIN2008R2-HPC-H. It has only one network adapter with a fixed IP address 192.168.5.1
2. On this server I configured it as Domain Controller and installed Windows Server SP1. Domain name: nstm.com
3. On this server I installed Windows 2008 R2 HPC Pack and configure it as Head Node.
4. On this server I also set up DHCP with an IPV4 Scope 192.168.5.11 - 192.168.5.250
5. Using the HPC Cluster Manager, I've configured:
5.1 The Network Configuration to all nodes on Enterprise (option 5)
5.2 Provided the instalation credentials
5.3 Configured the namimg of new nodes
5.4 Created a Compute template node Without Operating System6. On a new virtual machine I installed Windows Server 2008 R2 HPC Edition and named it WIN2008R2-CN001. It has one network adapter with DHCP enabled (DHCP has assigned the 192.168.5.11 IP address).
7. I added this computer the the Active Directory Domain: nstm.com (created previously)
8. On this computer I installed Windows 2008 R2 HPC Pack and configure it as Compute Node to be joined to the WIN2008R2-HPC-H Head Node. The following services are up and running:
8.1 HPC Management Service
8.2 HPC MPI Service
8.3 HPC Node Manager Service
9. I'm able to ping the Compute Node form the Head Node and vice versa.
10. I'm able to see the Compute Node from the HPC Cluster Manager and try to add it has a Compute NodeHere's my problem:
When I try to add the Compute Node (using the Node Template created in step 5.4) to the cluster I always get the following errors:
Could not contact node 'WIN2008R2-CN001' to perform change. Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
Could not contact node 'WIN2008R2-CN001' to perform change. The management service was unable to connect to the node using any of the IP addresses resolved for the node.
Could not contact node 'WIN2008R2-CN001' to perform change. Connection Failed. Unable to read data from the transport connection: An established connection was aborted by the software in your host machine.
Could not contact node 'WIN2008R2-CN001' to perform change. The management service was unable to connect to the node using any of the IP addresses resolved for the node.Final Notes:
These messages always appear after the following information message: Checking the configuration of node NSTM\WIN2008R2-CN001
I've Checked the DNS Server (also configured in the Head Node) and there's a record for WIN2008R2-CN001 with IP Address 192.168.5.11
IPv6 is also configured and on both machines and ping using OPv6 is also working
Firewall settings were all configured by default and during the Windows 2008 HPC Pack installation I allowed the install to perform the required changes on both machines. (I've also tried to disable the firewall on both machines and the result was the same)
What might be the cause of the failure?
Thanks,
Nelson Morais
Nelson Morais nelsonmorais@yahoo.comSunday, March 6, 2011 2:58 PM
Answers
-
Hello Brian,
I was able to fix the problem. I'm using virtual machines to run the tests on the cluster, and the node machine was created using VMWare Linked Clone option. Once I completelly install a new VM for the compute node I was able to add it as a compute node.
I presumed the cause of the problem was somehow related with the VMWare Linked Clone.
Thanks,
Nelson Morais nelsonmorais@yahoo.com- Marked as answer by Nelson Morais Saturday, March 26, 2011 1:12 AM
Saturday, March 26, 2011 1:12 AM
All replies
-
Hi Nelson,
Can you verify that DNS is correctly resolving addresses on both nodes?
For example, on both WIN2008R2-HPC-H and WIN2008R2-CN001, you are able to resolve both nodes:
nslookup WIN2008R2-HPC-H
nslookup WIN2008R2-CN001Thanks,
--BrianFriday, March 25, 2011 11:54 PM -
Hello Brian,
I was able to fix the problem. I'm using virtual machines to run the tests on the cluster, and the node machine was created using VMWare Linked Clone option. Once I completelly install a new VM for the compute node I was able to add it as a compute node.
I presumed the cause of the problem was somehow related with the VMWare Linked Clone.
Thanks,
Nelson Morais nelsonmorais@yahoo.com- Marked as answer by Nelson Morais Saturday, March 26, 2011 1:12 AM
Saturday, March 26, 2011 1:12 AM -
The problem appears becouse a system is cloned. And as result, two or more systems have the same GUID.
You need to do sysprep with Generalize selected on every cloned system. After that cloned system will restart and you will need to join it into domain again (Dont forget to delete its computer account on AD previously). When you will add it on cluster a problem will disappear.
Hope this will help you
Please, don't forget to vote as helpful and mark as answered if the answer helped to solve your problem- Proposed as answer by SergiiKorin Thursday, October 27, 2011 2:44 PM
Thursday, October 27, 2011 2:44 PM