none
Unable to add a HPC Compute Node Following IP Address Change RRS feed

  • Question

  • Hi,

    I recently had a hardware problem with a particular compute node that resulted in it being down for a short while. When the problem was eventually resolved, the server had assumed a new ip address from the DHCP pool and the device was no longer usable within the Cluster Manager client tool on the HPC Head Node server.   

    I have tried reinstalling the HPC programs to see if this would resolve the problem but that hasn't worked either.

    Does anyone have any ideas as to how I can debug this? Which log files to check?

    Thanks

    Cossy



    Monday, July 29, 2013 8:04 AM

All replies

  • What's the HPC Pack and OS version on your head node and compute node?

    Can you ping  head node <=>compute node?

    Can you turn off firewall on both machines?

    On your compute node check services state that start with "HPC".Are they in running state?

    On compute node type set in the console (cmd.exe) and confirm that CCP_SCHEDULER variable points to head node


    Daniel Drypczewski

    Tuesday, July 30, 2013 5:44 AM

  • Hi Daniel,

    Thanks for replying.

    What's the HPC Pack and OS version on your head node and compute node?
    Headnode & Compute Node
    Microsoft HPC 2008 R2 3.3.3950

    Can you ping  head node <=>compute node?
    Yes, I can ping both the compute node from the Head nodeand vice versa using each nodes respective hostname.

    Can you turn off firewall on both machines?
    I have turned off the firewall on both machines and also reinstalled the HPC Software components that we run on the compute node whilst both firewalls were down to reestablish visibity from the headnode but still no luck:
    -Microsoft HPC Pack 2008 R2 Client Components.
    -Microsoft HPC Pack 2008 R2 Server Components.
    -Microsoft HPC Pack 2008 R2 LINQ to HPC Components(Preview).
    -Microsoft HPC Pack 2008 R2 MS-MPI Redistributable Pack.

    On your compute node check services state that start with "HPC".Are they in running state?
    I have three services on the compute node and they are all in the running state:
    -HPC Management Service
    -HPC MPI Service
    -HPC Node Manager Service

    I've checked the services on the Headnode and they are in the following state:

    Status   Name               DisplayName
    ------   ----               -----------
    Stopped  HPCBasicProfile     HPC Basic Profile Web Service
    Running  HpcBroker           HPC Broker Service
    Running  HpcDiagnostics      HPC Diagnostics Service
    Running  HpcDsc             HPC Dsc
    Running  HpcManagement       HPC Management Service
    Running  HpcNodeManager     HPC Node Manager Service
    Running  HpcReporting        HPC Reporting Service
    Running  HpcScheduler        HPC Job Scheduler Service
    Running  HpcSdm              HPC SDM Store Service
    Running  HpcSession          HPC Session Service
    Stopped  HpcStorageSurro...  HPC Storage Management Surrogate

    On compute node type set in the console (cmd.exe) and confirm that CCP_SCHEDULER variable points to head node.
    It does point to the Headnode.

    Two other things...

    In the event log on the compute node, I see the following error message:
    "The HPC Management Service encountered an error communicating with the head node: macAddress.
    Verify that the HPC services are running on each node and there is network connectivity between each node. The HPC Management Service will attempt to reconnect in 5 minutes."

    I have managed to workaround this problem by utilising the "Import/Export from XML file" option when adding a node. I exported the details of a working node to an XML file, copied and then amended it to reflect the failing node and its mac addresses and was then able to reimport the failing node using the amended file. The previously failing node now is in an online state and i can run simple comands like 'dir' against it but i'm still seeing the previously mentioned error message so I'm not sure if its working entirely correctly.

    Any further advice appreciated.  :0)

    Cossy 


    Cossy


    Tuesday, July 30, 2013 8:04 AM
  • After seeing your log message now I remember I have encountered this problem.I have notice that when the compute node is reinstalled without it being removed at first from the Cluster Manger conectivity problems may happen.When you need to reinstall your computer node, this is safe way of doing it:

    1.First , remove the compute node from HPC Cluster Manager before any reinstallation  (this is important)

    2.Remove the compute node from the domain

    3.Reinstall the compute node

    3.Add the compute node to the domain again & restart

    Can you check also your Active Directory and DNS server ? Maybe duplicate entries exist or wrong machine name and IP address assignments.

    (Server Manager\Roles\Active Directory Domain Services\Active Domain Users & Computers and Server Manager\Roles\DNS Server\DNS\)


    Daniel Drypczewski

    Thursday, August 1, 2013 1:49 AM
  • Thanks again for the suggestion Daniel.

    I have to rebuild another of the worker nodes on the grid and so I'll be going through the same process as before.

    Ill try the steps that you suggest and see if it brings me a little more success second time around.

    Thanks again.

    Cossy

      


    Cossy

    Thursday, August 1, 2013 7:42 AM