none
CCS 2003 cluster connectivity problems! RRS feed

  • Question

  • We are experiencing very serious problems with our cluster.
    My bet is they are connectivity problems: some nodes are randomly appearing as "Unreachable" in the compute cluster admin console, and even more serious, scheduling takes forever to complete.
    When I submit a job to run on few nodes (1 or 2) it remains in the queue for 1 or 2 MINUTES! on a completely unloaded (nothing running) cluster.
    A job requiring more nodes (say, HPL on 96 cores) remains in the waiting state for minutes and then fails, with no error messages.
    Again, no job using more than 1 or twoo nodes can be submitted!
    It is like the scheduler cannot contact compute nodes.

    The strange part is: connectivity is OK for remote desktop, pinging nodes shows delays <1ms on both the ethernet and the IB connections.
    Can it be an SQL problem ?
    Or a problem resolving names?
    How can I diagnose it?

    In event viewer I have only a couple of warnings on MRxSmb and DnsApi.
    CcpScheduler writes every now and then "Slow heartbeater: missed a heartbeat with node XXXX. An error occurred while communicating with compute node XXXX. Failed to Ping node XXXX.."

    Please help!!


    PS: note that I already raised the same issues
    http://archives.windowshpc.net/forums/thread/1708.aspx
    but after a first contact nobody answered.

    Now the problem got worse and I really need to use the cluster to submit large (>=32 nodes) MPI jobs!
    Wednesday, January 23, 2008 8:46 AM

Answers

  • Well, after all the trouble we digged out the problem. It was simple, stupid, and related to network configuration (IP addreses on the MPI network were right, on the Private network wrong). I think that this mix caused the scheduler to be unstable, but to succeed in working sometimes.
    Thanks to everybody for the help!

    Lorenzo

    Monday, February 4, 2008 10:28 AM

All replies

  •  

    Hi Lorenzo,

     

    Please check that the node management services are running on the compute nodes.   The "Compute Cluster Node Management" service should be running.  Restart these services.

     

    Other ideas:

     

    1.  The node management services may be running but are not associated with the specific head-node.  In this case, delete the "unreachable" node from the head-node Administrator Console.  Then, use the "Add Node" wizard to select the compute node by name and add it again to the cluster.

     

    Ok...  That's just 1 extra idea...  But, let us know if the problem remains and we'll give it more thought.

     

    Best Regards,

    Phil 

    Wednesday, January 23, 2008 8:19 PM
  • Other ideas: 
    - what are your firewall settings? Have you got any group policies that may conflict with those mandated by compute cluster admin?
    - do you see repeated errors related to dns or active directory in the compute node event logs?
    - do you see any frequent scheduler service restarts, accompanied by .net faults?

    Last but not least, have you placed a support call and if so can you give us the case number?

    Giovanni

    Wednesday, January 23, 2008 10:30 PM

  • Hi Phil.

    I have tried to restart the service on all the machines, but clusrun fails with the exception
    "The read operation failed, see inner exception."
    I tried to chack on 3 or 4 nodes, the service was running ok on them.

    I have deleted the unreachable nodes. Even after deleting them, clusrun don't work (nor the scheduler).
    Other info:
    -sometime the job manager requires a minute to start
    - in the event log several errors related to hearbeats and performance counters apperared:

    "The configuration information of the performance library "C:\WINDOWS\system32\perfts.dll" for the "TermService" service does not match the trusted performance library information stored in the registry. The functions in this library will not be treated as trusted."

    "Scheduler heartbeater: missed a heartbeat with node ENC1BLADE01. Attempted to read or write protected memory. This is often an indication that other memory is corrupt.."

    from the Event logs, it appears that the scheduler requires ages to contact the nodes:
    it writes to the logs: "The IP address of node ENC3BLADE10 is 10.2.0.105.", at intervals that are from 2 seconds to 1 minute. After 7 minutes it contacted only 20 nodes.
    Note that ping and nslookup resolve names instantaneously.

    Lorenzo
    Thursday, January 24, 2008 4:09 PM
  • Hi Giovanni,

    - firewall settings were not changed. Nodes are in a private network, and the policy is to have the firewall disabled.
    - on compute nodes I have same errors like
    "The Management service encountered an error communicating with the head node. Verify that the compute cluster services are running on each node and there is network connectivity between each node. No connection could be made because the target machine actively refused it".

    However, services are running on both machines.

    I also have some Dnsapi warnings:

    "The system failed to register host (A) resource records (RRs) for network adapter
    with settings:

       Adapter Name : {5C8E97F0-DB39-49A4-B1D5-0FFB18146F8F}
       Host Name : enc3blade06
       Primary Domain Suffix : cluster.loc
       DNS server list :
             10.2.0.1
       Sent update to server : <?>
       IP Address(es) :
         10.2.0.101

     The reason the system could not register these RRs was because either (a) the DNS server does not support the DNS dynamic update protocol, or (b) the authoritative zone for the specified DNS domain name does not accept dynamic updates.

     To register the DNS host (A) resource records using the specific DNS domain name and IP addresses for this adapter, contact your DNS server or network systems administrator."

    I do have periodic scheduler restarts with .NET exceptions: "Object reference not set to an instance of an object" and "The attempt to read or update the store failed."

    Lorenzo


    Thursday, January 24, 2008 4:32 PM
  • Giovanni, Phil,

    Any other idea? What should I do? Which is the easiest way to go? Maybe reinstalling CCS, or reinstalling the headnode?
    Please give me some hints, as I really don't know what to do.

    Lorenzo

    Wednesday, January 30, 2008 2:47 PM
  • Well, after all the trouble we digged out the problem. It was simple, stupid, and related to network configuration (IP addreses on the MPI network were right, on the Private network wrong). I think that this mix caused the scheduler to be unstable, but to succeed in working sometimes.
    Thanks to everybody for the help!

    Lorenzo

    Monday, February 4, 2008 10:28 AM