none
Node Unreachable

    Question

  • I have a small Compute Cluster that I am rebuilding.  Two of the nodes were identified in the Node Management interface as Unreachable.  If the Microsoft Compute Cluster Node Manager Service on the node is restarted then the node changes to Ready for about a minute before it becomes Unreachable again.

     

    One of the nodes became Ready again and appears to be stable but the other one continues to be unreachable.

     

    The CcpManagement.log doesn't show anything interesting

    The application and system logs are also uninteresting.

     

    What could be causing this behavior?

     

    Thanks

    Thursday, May 29, 2008 1:02 AM

Answers

  • Interestingly (to me)

     

    Ping between the head node and the compute node didn't work but ping between any compute nodes worked.  This was caused by name resolution on the headnode.

     

    The ip address in the host file on the headnode was incorrect but a compute node appeared to query the ICS DNS table which was correct.

     

    Pinging the IP address from the headnode worked but didn't help other than demonstrating the basic network was working.

     

    Apparently the  compute cluster pack changes the host file as needed.  I am assuming that the compute node cluster management service was broken at some level.  If the service was restart, it would bring the compute node back on line for about 60 seconds.

     

    Re-install of the compute cluster pack didn't affect the situation e.g. symptoms were the same.

     

    Bare metal re-install of the compute node seems to have fixed the problem.

    • Proposed as answer by Brian Broker Wednesday, July 02, 2008 10:58 AM
    • Marked as answer by Allan Hilchie Monday, August 25, 2008 10:26 PM
    Friday, May 30, 2008 4:17 PM

All replies

  • Hi Allan,

     

    Can you successfully ping between head node and compute node (both ways)?  How about a "net use" between the nodes.  These simple tests might help provide clues about the underlying issue.

     

    --Brian

    Friday, May 30, 2008 5:42 AM
  • Interestingly (to me)

     

    Ping between the head node and the compute node didn't work but ping between any compute nodes worked.  This was caused by name resolution on the headnode.

     

    The ip address in the host file on the headnode was incorrect but a compute node appeared to query the ICS DNS table which was correct.

     

    Pinging the IP address from the headnode worked but didn't help other than demonstrating the basic network was working.

     

    Apparently the  compute cluster pack changes the host file as needed.  I am assuming that the compute node cluster management service was broken at some level.  If the service was restart, it would bring the compute node back on line for about 60 seconds.

     

    Re-install of the compute cluster pack didn't affect the situation e.g. symptoms were the same.

     

    Bare metal re-install of the compute node seems to have fixed the problem.

    • Proposed as answer by Brian Broker Wednesday, July 02, 2008 10:58 AM
    • Marked as answer by Allan Hilchie Monday, August 25, 2008 10:26 PM
    Friday, May 30, 2008 4:17 PM