none
HPC Pack 2016 - Connections to Scheduler "forcibly closed" RRS feed

  • Question

  • Is there any useful logging anywhere that can show me some potential causes for scheduler connections being "forcibly closed"?  I'm getting nothing out of the ordinary in the Windows Event Logs.

    Background...

    I just went through the Getting Started Guide for HPC 2016 (https://technet.microsoft.com/en-us/library/mt791808(v=ws.11).aspx).  Simple setup, single Server 2012 R2 head node with three Server 2012 R2 ComputeNodes.  From the Cluster Manager, all looks healthy, but any attempts at connecting a Scheduler from within Visual Studio or a console app results in "An existing connection was forcibly closed by the remote host." Same behavior if I attempt a connection to LocalHost on the headnode, or to the headnode's FQDN or hostname from any other node.

    I am using the HPC 2012 R2 Update3 Client Utilities, the HPC 2012 R2 Update3 SDK, within Visual Studio 2015 Community.  Using the "Connecting to a Cluster" MSDN example (https://msdn.microsoft.com/en-us/library/cc853425(v=vs.85).aspx) fails at "Scheduler.Connect("HeadNodeName").

    Side note, a Wireshark trace shows the Syn,Syn-Ack,Ack sequence setting up fine, then the Headnode is sending a RST, so pretty sure it isn't a DNS, IP, network, or firewall issue as I'm getting connected to the right host.

    Thursday, February 23, 2017 8:07 PM

Answers

  • Hi, Matt

      Supporting lower version of client will come in a few months (In HPC Pack 2016 Update 1), sorry for the trouble. You need to try the latest SDK, currently it is in beta: https://www.nuget.org/packages/Microsoft.HPC.SDK/5.0.5852-beta1

       To view the logs, you shall be able to get the log view app from this site: https://hpctss.azurewebsites.net/.


    Qiufang Shi

    • Marked as answer by MattManDL Friday, February 24, 2017 3:33 PM
    Friday, February 24, 2017 1:37 AM
  • Hi, Matt:

    Could you help to confirm if the HPC Pack 2016 cluster is in a domain joined environment?

    And would you help to ensure again, once any of the diagnostic's test run, the files would be written to "DiagnosticsShare"(The path in use can be find by powershell command "add-pssnapin Microsoft.hpc;Get-HpcClusterRegistry")?

    For the Diagnostics, each diagnostic's run would create several Scheduler Admin jobs, you can find the Admin jobs by Diagnostics View -> In the right Actions Panel find 'Pivot To' -> 'Jobs for the Tests'. The issue you described is actually the admin job running Post Step failed.

    Could you run a new Diagnostics test, and help to provide the failed admin job's ID, and send the latest several HpcNodeManager_*.bin and HpcScheduler_*.bin logs in %CCP_DATA%LogFiles\Scheduler on the Head Node which are generated during the latest Diagnostics Test to hpcpack@microsoft.com through mail?

    Thanks,

    Jason

     

    • Marked as answer by MattManDL Friday, February 24, 2017 3:33 PM
    Friday, February 24, 2017 3:20 AM

All replies

  • Looking in C:\Program Files\Microsoft HPC Pack 2016\Data\LogFiles, I see all kinds of files in there, but they are all .bin files, and aren't legible.  Is there some sort of log viewer application needed to review these?

    Also, running ANY of the diagnostics reports inside HPC Cluster Manager ends with:

    Post Step run failed due to : Error from node: <headnode> :Microsoft.Hpc.Activation.NodeManagerException: One or more errors occurred.Exception 'System.AggregateException: One or more errors occurred. ---> System.ComponentModel.Win32Exception: The system cannot find the path specified

    I found some other posts that hint at a missing "Diagnostics" share on the head node, but it is there and files are written to it each time a Diag Report is run.

    Thanks!

    Thursday, February 23, 2017 9:57 PM
  • Hi, Matt

      Supporting lower version of client will come in a few months (In HPC Pack 2016 Update 1), sorry for the trouble. You need to try the latest SDK, currently it is in beta: https://www.nuget.org/packages/Microsoft.HPC.SDK/5.0.5852-beta1

       To view the logs, you shall be able to get the log view app from this site: https://hpctss.azurewebsites.net/.


    Qiufang Shi

    • Marked as answer by MattManDL Friday, February 24, 2017 3:33 PM
    Friday, February 24, 2017 1:37 AM
  • Hi, Matt:

    Could you help to confirm if the HPC Pack 2016 cluster is in a domain joined environment?

    And would you help to ensure again, once any of the diagnostic's test run, the files would be written to "DiagnosticsShare"(The path in use can be find by powershell command "add-pssnapin Microsoft.hpc;Get-HpcClusterRegistry")?

    For the Diagnostics, each diagnostic's run would create several Scheduler Admin jobs, you can find the Admin jobs by Diagnostics View -> In the right Actions Panel find 'Pivot To' -> 'Jobs for the Tests'. The issue you described is actually the admin job running Post Step failed.

    Could you run a new Diagnostics test, and help to provide the failed admin job's ID, and send the latest several HpcNodeManager_*.bin and HpcScheduler_*.bin logs in %CCP_DATA%LogFiles\Scheduler on the Head Node which are generated during the latest Diagnostics Test to hpcpack@microsoft.com through mail?

    Thanks,

    Jason

     

    • Marked as answer by MattManDL Friday, February 24, 2017 3:33 PM
    Friday, February 24, 2017 3:20 AM
  • Thank you very much for the reply! I was afraid I was going to have to tear this all down and start over.

    I was able to open Visual Studio, use the Package Manager Console to run the "Install-Package Microsoft.HPC.SDK  -Pre" command listed in the link you provided (https://www.nuget.org/packages/Microsoft.HPC.SDK/5.0.5852-beta1),  and now I can get Visual Studio and console apps to connect to the cluster!  Thank you so much!

    Also, I was able to download the Log Viewer app from https://hpctss.azurewebsites.net.  For anyone else reading this, you don't have to login or register an email, there is a "Get Log Viewer App" link at the very top (https://hpconlineservice.blob.core.windows.net/logviewer/LogViewer.UI.application).

    Thanks!

    • Edited by MattManDL Friday, February 24, 2017 3:32 PM
    Friday, February 24, 2017 3:13 PM
  • I think I have this all working now, although not 100% sure what fixed it.  Yes, it is a domain joined environment, headnode and all compute nodes, joined to the same domain, all on the Enterprise network.

    I uninstalled all the HPC 2012 R2 components, which broke the HPC PowerShell snapin.  Reinstalled the HpcClient_x64.msi from the HPC Pack 2016 media (it is in the "setup" folder when you extract the HPC Pack 2016 zip file).  That fixed the PowerShell snapin.  I was able to run the "Get-HPCClusterRegistry" cmdlet, and that output all looked good. All paths, shares, databases, looked good.

    Now when I run Diagnostic Reports, all of them Complete/Succeed.

    I didn't change anything with any of the shares, permissions, etc, but all reports are working now.  Could the HPC 2012 R2 Client Utilities and SDK have been causing an issue with the HPC Cluster Manager?  Once I uninstalled those, reinstalled HPC 2016 Client Utilities, and installed the Microsoft.HPC.SDK 5.0.5852-beta1, everything began to work.

    FYI, I installed the beta SDK on both the headnode, as well as on my client where I'm running Visual Studio.  I rather doubt it was necessary on the headnode though, correct?

    Thank you both (Jason and Qiufang) for your replies.  I am really excited to get this cluster up and running, we have some exciting projects for it.

    -Matt

    Friday, February 24, 2017 3:20 PM