locked
Enterprise network connection goes on/off every 2-5 seconds on Windows HPC Server 2008 RRS feed

  • Question

  • I have a weird issue with my Windows HPC Server 2008 cluster:

    When I start a Remote Desktop connection from the clusters' head node to any compute node, the Enterprise network connection suddenly drops and then periodically reports a connected/disconnected cable every 2-5 seconds. Only reboot of the head node makes the problem go away.

    The problem is present for both generic remote desktop and for HPC Cluster Manager remote desktop panel. Also the issue appears when I run "Pending Software Updates" test from HPC Cluster Manager Diagnostics tab.

     

    At the same time I can run commands and submit jobs on compute nodes through the HPC Cluster Manager without any problems.

    The cluster consists of 9 nodes (including the head node). Network topology is "Topology 3": compute nodes are isolated on private and application networks.
    Enterprise and Private network adapters: Intel(R) 82575EB Gigabit
    Application network adapter: OpenFabrics

     

    The head node roles are: Active Directory Domain Services, DHCP and DNS server, File Services, Network Policy and Access Services; Windows Deployment Services

     

    When the problem arises, event viewer for the head node lists:

     

    1)    EventID: 27

    Source: e1qexpress

    Network link has been disconnected.

    2)    EventID: 32

    Source: e1qexpress

    Network link has been established at 1Gbps full duplex.

    3)    EventID: 4201

    Source: Tcpip

    The system detected that network adapter Enterprise was connected to the network, and has initiated normal operation.

     

    The sequence of these 3 events is repeated every 2-5 seconds until I reboot the head node.

     

    I have also noticed that before the network drops for the first time, this event is recorded:

     

    EventID: 5782

    Source: NETLOGON

    Dynamic registration or deletion of one or more DNS records associated with DNS domain <my.corporate.domain> failed.

     

    But at the same time, all Connectivity tests in the HPC Cluster Manager succeed (including DNS Name Resolution) before the network drop.

     

    I’ve tried various DNS settings both in TCP/IP configuration for NICs and in DNS server console, but it did not help.

    Tuesday, December 29, 2009 3:09 PM

Answers

  • Looks like the drivers did the trick as there hasn't been any action on this thread in quite some time.  I'll propose the drivers resolution as the answer and you can reopen if necessary.
    Friday, July 30, 2010 8:52 PM

All replies

  • Hi Igor,

      I've found a few things related to this that will hopefully help.

      Firstly, according to this article on Microsoft's TechNet (http://www.microsoft.com/technet/support/ee/transform.aspx?ProdName=Windows%20Operating%20System&ProdVer=5.2&EvtID=5782&EvtSrc=NetLogon&LCID=1033)

      You can solve the issue relating to the EventID 5782 (Dynamic registration or deletion of one or more DNS records...) by completing the follwing steps:

       "To initiate dynamic deletion on the DNS server, do the following:

    1. Run DCDiag.exe.
      This program is located on your Windows Server 2003 CD in the Support\Tools folder.
    2. Fix any problems identified by DCDiag.exe.
    3. At the command prompt on the DC, type
      nltest.exe /dsregdns
      The Nttest.exe program is available on the Microsoft Windows Server 2000 Resource Kit CD."

    However, as I'm unsure of what OS you're running, I don't know if this will be applicable but I think the key take away point is "initiating dynamic deletion on the DNS server".

    I have also seen the following articles that may be of some help (again they are dated - but hopefully it can get you going in the right direction) http://www.eventid.net/display.asp?eventid=5782&eventno=481&source=NETLOGon&phase=1

    From the above link I saw the following tidbit of information that I suspect is useful.  (there are others on the same page that may be helpful also)

    Event ID 5782 - NETLOGON Warning

    "This will happen if your machine is configured to use a dns server that does not support or will not accept dynamic updates. This would be the case if your machine is pointing to the DNS server your ISP provided. To stop your computer from trying to register with dns simply go into the dns tab of the advanced tcp/ip properties of your connection and uncheck both boxes at the bottom of the window. However, if you are part of an an Active Directory infrastructure and need to use dynamic updates then you must remove your ISP dns server from your settings and only use your w2k dns server"

    I am thinking that a change in your DNS settings to allow dynamic deletion will help with your networking - but I'm not 100% sure.

    Would you be able to test this and let us know if you still have any issues? 

    Thanks,
    Mark

    --
    Mark Staveley
    SDET II - US High Performance Computing
    Microsoft 

    Tuesday, December 29, 2009 7:28 PM
  • Hi Mark,

    Thanks for looking into it. Running nltest.exe /dsregdns unfortunately didn’t help with the network, but the EventID changed to 5781 instead of 5782:

     

    Dynamic registration or deletion of one or more DNS records associated with DNS domain 'my.corporate.domain.' failed.  These records are used by other computers to locate this server as a domain controller (if the specified domain is an Active Directory domain) or as an LDAP server (if the specified domain is an application partition). 

     

    Possible causes of failure include: 

    - TCP/IP properties of the network connections of this computer contain wrong IP address(es) of the preferred and alternate DNS servers

                [Igor]: For the Enterprise network connection I use a corporate DNS server from another domain as preferred and localhost as alternate DNS server. For both Private and Application network connections I use localhost as preferred DNS server. Is this configuration correct?

    - Specified preferred and alternate DNS servers are not running

                [Igor]: DNS Server service is running on localhost.

    - DNS server(s) primary for the records to be registered is not running

    [Igor]: How can I figure out which DNS server is primary for the records to be registered?

    - Preferred or alternate DNS servers are configured with wrong root hints

    [Igor]: How can I check correctness of root hints configuration?

    - Parent DNS zone contains incorrect delegation to the child zone authoritative for the DNS records that failed registration 

    [Igor]: How can I perform delegation checking?

     

    USER ACTION 

    Fix possible misconfiguration(s) specified above and initiate registration or deletion of the DNS records by running 'nltest.exe /dsregdns' from the command prompt on the domain controller or by restarting Net Logon service on the domain controller.

    [Igor]: I have restarted the Net Logon service, but it didn’t help.


    Also, all DCdiag.exe tests passed successfully except for SystemLog test which indicated 1 suspicious entry:

          An Warning Event occurred.  EventID: 0x000003FC
          Time Generated: 12/30/2009   01:23:48
          Event String: Scope, <private network>, is 100 percent full with only 0 IP addresses remaining.

    It is related to DHCP server lease scope, but I didn't find any information on the harmfulness of this warning in
    http://support.microsoft.com/kb/261964 and http://support.microsoft.com/kb/153072/ Microsoft knowledge base articles. Can it cause enterprise network failure?

     

    I have also tried various advices from http://www.eventid.net/display.asp?eventid=5782&eventno=481&source=NETLOGon&phase=1 and http://eventid.net/display.asp?eventid=5781&eventno=167&source=NETLOGON&phase=1 web pages. As a result, I still get 5781 or 5782 event and after that enterprise network goes off.

     

    [Update] The only thing I did actually fix is that now NETLOGON failure (5781) appears in the system log AFTER the network disconnects for the first time

    Wednesday, December 30, 2009 10:52 AM
  • Hi Igor,

      To help with diagnosing this, would you be able to provide the output from ipconfig /all on the Head Node and the output from the Network Configuration Report (as generated through the head node - Cluster Manager - Network Configuration Wizard).

      Also you may find the following thread interesting as it discusses something similar (not the same) but I suspect related. (http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/1b26df0d-e838-4fe0-8645-df95f4e5460a)

      Hopefully, I'll be able to do more once I get that information from you.

    Mark

    --
    Mark Staveley
    SDET II - US High Performance Computing
    Microsoft 
    Wednesday, December 30, 2009 5:38 PM
  • One other question Igor,

      Are the network drivers current for your network adapters?  Are the private and enterprise network connections on the same card / board?

    Thanks,
    Mark

    Wednesday, December 30, 2009 8:11 PM
  • Mark,

     

    Unfortunately I already cannot provide the requested information because we decided to rebuild the cluster in order to wipe out this problem.

     

    Before the OS uninstallation I’ve checked NIC drivers and they are current.

     

    Private and enterprise network connections physically are on different NICs, but boards themselves are the same: Intel(R) 82575EB Gigabit

     

    I’ll inform you on the results of OS reinstallation in a few days.

     

    Thank you,

    --Igor

     

     

     

    Tuesday, January 5, 2010 11:35 PM
  • Please keep me posted.  The initial investigations I did made me think that network interface is on the motherboard with two cable jacks - but I wanted to confirm this for your particular hardware.
    Wednesday, January 6, 2010 12:29 AM
  • One other thing Igor - I presume you are not only updating the drivers on the Head Node - but on the compute nodes also...
    Wednesday, January 6, 2010 12:30 AM
  • Looks like the drivers did the trick as there hasn't been any action on this thread in quite some time.  I'll propose the drivers resolution as the answer and you can reopen if necessary.
    Friday, July 30, 2010 8:52 PM