CCS 2003 cluster connectivity problems!
-
woensdag 23 januari 2008 8:46We are experiencing very serious problems with our cluster.
My bet is they are connectivity problems: some nodes are randomly appearing as "Unreachable" in the compute cluster admin console, and even more serious, scheduling takes forever to complete.
When I submit a job to run on few nodes (1 or 2) it remains in the queue for 1 or 2 MINUTES! on a completely unloaded (nothing running) cluster.
A job requiring more nodes (say, HPL on 96 cores) remains in the waiting state for minutes and then fails, with no error messages.
Again, no job using more than 1 or twoo nodes can be submitted!
It is like the scheduler cannot contact compute nodes.
The strange part is: connectivity is OK for remote desktop, pinging nodes shows delays <1ms on both the ethernet and the IB connections.
Can it be an SQL problem ?
Or a problem resolving names?
How can I diagnose it?
In event viewer I have only a couple of warnings on MRxSmb and DnsApi.
CcpScheduler writes every now and then "Slow heartbeater: missed a heartbeat with node XXXX. An error occurred while communicating with compute node XXXX. Failed to Ping node XXXX.."
Please help!!
PS: note that I already raised the same issues
http://archives.windowshpc.net/forums/thread/1708.aspx
but after a first contact nobody answered.
Now the problem got worse and I really need to use the cluster to submit large (>=32 nodes) MPI jobs!
Alle reacties
-
woensdag 23 januari 2008 20:19Eigenaar
Hi Lorenzo,
Please check that the node management services are running on the compute nodes. The "Compute Cluster Node Management" service should be running. Restart these services.
Other ideas:
1. The node management services may be running but are not associated with the specific head-node. In this case, delete the "unreachable" node from the head-node Administrator Console. Then, use the "Add Node" wizard to select the compute node by name and add it again to the cluster.
Ok... That's just 1 extra idea... But, let us know if the problem remains and we'll give it more thought.
Best Regards,
Phil
-
woensdag 23 januari 2008 22:30Other ideas:- what are your firewall settings? Have you got any group policies that may conflict with those mandated by compute cluster admin?- do you see repeated errors related to dns or active directory in the compute node event logs?- do you see any frequent scheduler service restarts, accompanied by .net faults?Last but not least, have you placed a support call and if so can you give us the case number?Giovanni
-
donderdag 24 januari 2008 16:09
Hi Phil.
I have tried to restart the service on all the machines, but clusrun fails with the exception
"The read operation failed, see inner exception."
I tried to chack on 3 or 4 nodes, the service was running ok on them.
I have deleted the unreachable nodes. Even after deleting them, clusrun don't work (nor the scheduler).
Other info:
-sometime the job manager requires a minute to start
- in the event log several errors related to hearbeats and performance counters apperared:
"The configuration information of the performance library "C:\WINDOWS\system32\perfts.dll" for the "TermService" service does not match the trusted performance library information stored in the registry. The functions in this library will not be treated as trusted."
"Scheduler heartbeater: missed a heartbeat with node ENC1BLADE01. Attempted to read or write protected memory. This is often an indication that other memory is corrupt.."
from the Event logs, it appears that the scheduler requires ages to contact the nodes:
it writes to the logs: "The IP address of node ENC3BLADE10 is 10.2.0.105.", at intervals that are from 2 seconds to 1 minute. After 7 minutes it contacted only 20 nodes.
Note that ping and nslookup resolve names instantaneously.
Lorenzo -
donderdag 24 januari 2008 16:32Hi Giovanni,
- firewall settings were not changed. Nodes are in a private network, and the policy is to have the firewall disabled.
- on compute nodes I have same errors like
"The Management service encountered an error communicating with the head node. Verify that the compute cluster services are running on each node and there is network connectivity between each node. No connection could be made because the target machine actively refused it".
However, services are running on both machines.
I also have some Dnsapi warnings:
"The system failed to register host (A) resource records (RRs) for network adapter
with settings:
Adapter Name : {5C8E97F0-DB39-49A4-B1D5-0FFB18146F8F}
Host Name : enc3blade06
Primary Domain Suffix : cluster.loc
DNS server list :
10.2.0.1
Sent update to server : <?>
IP Address(es) :
10.2.0.101
The reason the system could not register these RRs was because either (a) the DNS server does not support the DNS dynamic update protocol, or (b) the authoritative zone for the specified DNS domain name does not accept dynamic updates.
To register the DNS host (A) resource records using the specific DNS domain name and IP addresses for this adapter, contact your DNS server or network systems administrator."
I do have periodic scheduler restarts with .NET exceptions: "Object reference not set to an instance of an object" and "The attempt to read or update the store failed."
Lorenzo -
woensdag 30 januari 2008 14:47Giovanni, Phil,
Any other idea? What should I do? Which is the easiest way to go? Maybe reinstalling CCS, or reinstalling the headnode?
Please give me some hints, as I really don't know what to do.
Lorenzo -
maandag 4 februari 2008 10:28
Well, after all the trouble we digged out the problem. It was simple, stupid, and related to network configuration (IP addreses on the MPI network were right, on the Private network wrong). I think that this mix caused the scheduler to be unstable, but to succeed in working sometimes.
Thanks to everybody for the help!
Lorenzo