none
Workstation nodes losing communication ability to headnode

    Question

  • We have several workstation nodes. Initially they were all able to be joined to the cluster fine - jobs could run on them and we could run hpc commands on them. Right now we have 3 nodes out of about 400 that lost connectivity to the headnode. HPC commands cannot be run on them.. (I already checked the registry entry is correct) and jobs cannot be run on them. I deleted the node out of hpc and reimaged the node, including deleting the object out of AD. Now I cannot assign a template to the node - it says 

    Time Message
    6/7/2016 10:43:50 AM Failed to communicate with remote SDM store. Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

    it makes no sense... the node can ping the headnode, the headnode can ping the node... they are in different OUs, but in the same domain - but I dont think the OUs make a difference as there are several working nodes that are in the same OU. Nevertheless, I even moved the node the same OU as the headnode, ran a gpupdate /Force and rebooted it. Same errors. DNS is working - I can do a nslookup of the forward and reverse entry of both the node and the headnode. 

    Does anyone have any ideas?

    Thanks!

    Tuesday, June 7, 2016 2:50 PM

All replies

  • what is also strange is that these nodes cannot access the \\headnode\reminst share... they can access all of the other shares - but not this one. I am using my id and I can access that share just fine on the working nodes. when I check computer management it says that the open mode for the open files is no access. 
    Wednesday, June 8, 2016 2:05 PM
  • Hi Nicka345,

    The access rule for \\headnode\reminst netshare is that it can be accessed by all "Authenticated Users", "Authenticated Users" is a Windows built-in user group, it contains all the users(and machine accounts) authenticated by a trusted domain controller. more about "Authenticated Users", please see https://social.technet.microsoft.com/Forums/office/en-US/e1a8e680-03a2-4690-a7e5-f17ad7389ecd/authenticated-users?forum=winserverDS 

    If you cannot access \\headnode\remist in these machines, something wrong with the domain trust relationship. This may be the reason why these nodes cannot join the cluster.

    Sunday, June 12, 2016 2:36 AM
  • Thanks for the response. However, I'm not sure that is it. We have another headnode that has a reminst share that has the same permissions and the node is able to authenticate to that headnode.... 

    I'm wondering if there is some corrupted cached setting on the headnode that I can clear out. I have rebooted the workstation node - but I really can't reboot the headnode. I have flushed dns - and removed all of the open files out of computer management... any other ideas?

    Thanks!

    Monday, June 13, 2016 2:24 PM
  • Hi,

    You mean the workstation nodes can access the remist share in another head node in same OU of this head node?

    Since you have removed them from AD when re-imaging, the SIDs of them were changed. Maybe the head node cached the old SIDs.  Try the following steps:

    1. Force the head node to refresh the SID cache by adding a registry value "LsaLookupCacheMaxSize",  following the article https://support.microsoft.com/en-us/kb/946358

    2. Remove these workstation nodes from HPC cluster management console, and reboot the workstation nodes.

    3. Assign node template again when the nodes shows again.

    4. Remove the registry value "LsaLookupCacheMaxSize" created in step 1.

    Btw, what is your HPC Pack version?

    Tuesday, June 14, 2016 7:38 AM
  • Sorry for the late response.

    That is correct about - the nodes are able to access teh reminst share of another headnode in the same ou as the headnode having problems... 

    I will try this and let you know. 

    It is windows 2012 r2 update 3 

    Thanks!

    Tuesday, June 21, 2016 7:55 PM