mercoledì 13 febbraio 2008 15:36We are experiencing serious problems with our cluster. Error message in Application Event Viewer on each compute mode:
The Management service encountered an error communicating with the head node. Verify that the compute cluster services are running on each node and there is network connectivity between each node. No connection could be made because the target machine actively refused it.
We have 20 nodes in the cluster. The head node and 9 compute nodes are on 1 switch. We recently added another 10 compute nodes (on another swtich) into cluster. Both switch are plugged into a layer 3 swtich and on same VLAN.
Here are information about our cluster environment.
1. On each compute node, Microsoft Compute Cluster Management Service, Compute cluster MPI Service, and Compute Cluster Node Manager Service show started when error happens.
2. Firewall settings were not changed recently. After 10 compute noded were added, some jobs were finished OK with all compute nodes before.
3. The issue happened in the late evening. When the issue was noticed in the morning, connectivity is OK for remote desktop, pinging compute nodes from head node shows delays <1ms.
4. Head node is domain controller (we have a backup domain controller in the network), and data storage, and SQL server.
5. When the issue happened, there is no related erro message in Event Viewer on head node.
Tutte le risposte
venerdì 15 febbraio 2008 19:54
Did this error happen once or twice or is it happening consistently?
Have you had a chance to run “ipconfig /all” on the offending node versus a non-offending node and note if there were any differences?
- Contrassegnato come risposta Don PatteeModerator venerdì 22 maggio 2009 20:37