locked
HPC Node Manager will not start RRS feed

  • Question

  • After using http://technet.microsoft.com/en-us/library/gg247477%28WS.10%29.aspx to enable the HPC_CreateConsole environment variable, the HPC Node Manager service stops as soon as it is started.

    I see this error message in the Event Logs.

    Service cannot be started. Microsoft.Hpc.Scheduler.Session.SessionException: Can't connect to the scheduler. ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 127.0.0.1:5800

    Server stack trace:
       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
       at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
       at System.Runtime.Remoting.Channels.RemoteConnection.CreateNewSocket(EndPoint ipEndPoint)
       at System.Runtime.Remoting.Channels.RemoteConnection.CreateNewSocket()
       at System.Runtime.Remoting.Channels.SocketCache.GetSocket(String machinePortAndSid, Boolean openNew)
       at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.SendRequestWithRetry(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream)
       at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Strea...

    Also, the service is not able to start on a server that did not get the registry commands. I did notice that although I did not change the registry on one of the servers, the registry changes still were applied to that machine.

    I am not sure how to fix this problem.

    Thanks,

    Daniel

    Tuesday, October 19, 2010 8:49 PM

Answers

All replies

  • Hi Daniel,

    Thanks a lot for reporting the issue. However, we will need some more information to understand the problem. I assume that you are using Windows HPC Server 2008 R2.

    Was your cluster up and running before you made the changes related to HPC_CreateConsole? Could you run simple jobs on your cluster at the time.

    Which of the specified options, clusrun or node template did you use to enable the HPC_CreateConsole setup?

    Could you tell us a bit about the cluster you are trying to setup? How many compute nodes do you have and how are they connected to the headnode (specifically the network topology you chose while setting up the cluster)?

    Are the compute nodes on your cluster domain joined?

    Answers to these questions will allow us to help you better.

    Thanks

    sayantan

     

    Wednesday, October 20, 2010 5:46 PM
    Moderator
  • I'm another admin in Daniel's group. Answers inline below:

    I assume that you are using Windows HPC Server 2008 R2.

    Yes.

     

    Was your cluster up and running before you made the changes related to HPC_CreateConsole? Could you run simple jobs on your cluster at the time.

    Yes. It was working fine.

     

    Which of the specified options, clusrun or node template did you use to enable the HPC_CreateConsole setup?

    We used the "Run command on node" tool from the HPC Node Manager GUI, and ran "reg add HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager /v HpcConsoleSupport /t REG_DWORD /d 1 /f & reg add HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System /v SoftwareSASGeneration /t REG_DWORD /d 1 /f" on all the nodes (we wanted all nodes to have this capability). The registry keys were added successfully, too.

     

    Could you tell us a bit about the cluster you are trying to setup? How many compute nodes do you have and how are they connected to the headnode (specifically the network topology you chose while setting up the cluster)?

    We have 10 compute nodes running Windows Server 2008 R2 + HPC Pack addon. They are all on the "Enteprise" network, although different subnets. Our head node is accessible from workstations and compute nodes, but compute nodes are only accessible from the head node (due to physical router/switch and demands of the overarching IT group here). The head node acts as an AD controller for the domain. When we updated the registry keys on the nodes, the service did not start on reboot; now the service also doesn't start on the head node, either.

     

    Are the compute nodes on your cluster domain joined?

    Yes.

     

    Thanks so much for your help.

    Eli.

    • Edited by elansey Wednesday, October 20, 2010 6:02 PM Corrected command line
    Wednesday, October 20, 2010 6:00 PM
  • Hi,

    From the exception  posted in the first mail, it seems that the scheduler service itself is having some trouble running. I would like to first figure out if the hpc scheduler service itself, is running. If it is not, I would like to try and restart it and see if it works. If it does not I would like to find the reason.

    On the headnode, from an elevated command window (running as an admin), could you do a

    sc query hpcscheduler

    Does that show the hpcscheduler as running or stopped?

    If it is not running, could you do (please note the time)

    net start hpcscheduler

    Does the command say that the service started?

    If it did not, it would be great if you could check the hpc scheduler service's event log. Here is how to find the specifc one in the eventviewer under:
    Application and Services Logs\Microsoft\Hpc\Scheduler\Operational.

    In this particular event log, could you see if there are any errors or warnings at the time at which you attempted the net start hpcscheduler?

    This information will help us see if the hpcscheduler service itself is facing some problems.

    If this service is working fine, we will investigate the hpc node manager service

    thanks

    sayantan

     

    Wednesday, October 20, 2010 7:47 PM
    Moderator
  • Hi Sayanta, I am another admin following up with this problem. I restarted the server few times with no luck. Before the first reply I completely uninstall HPC pack from the head node and when I tried to install it I get an error trying to start the service, but side tracking from that I ran the commands that you requested and this is what I get:

     

    C:\Users\Administrator>sc query hpcscheduler

    SERVICE_NAME: hpcscheduler
    TYPE : 10 WIN32_OWN_PROCESS
    STATE : 1 STOPPED
    WIN32_EXIT_CODE : 1077 (0x435)
    SERVICE_EXIT_CODE : 0 (0x0)
    CHECKPOINT : 0x0
    WAIT_HINT : 0x0
    and
    C:\Users\Administrator>net start hpcscheduler
    The HPC Job Scheduler Service service is starting.
    The HPC Job Scheduler Service service could not be started.

    The service did not report an error.

    More help is available by typing NET HELPMSG 3534.
    and last
    C:\Users\Administrator>NET HELPMSG 3534

    The service did not report an error.

     

    Now, coming back to trying to reinstall HPC Pack on the head node I get the following error

    Service 'HPC Node Manager Service' (HpcNodeManager) could not be installed. Verify that you have sufficient privileges to install system services. [Cancel],[Retry],[Ignore]

    If I hit Retry I get the same message If I hit Ignore or Cancel it stops the installation giving me a log path: C:\Windows\Temp\HPCSetupLogs

    Note: I am running the installation as Administrator

    I the logs folder I get too many files for me to post the actual log so I'm just going to list them and please tell me which one you want me to post.

     

    hpcMsi-20101003-1547.txt
    hpcMsi-20101003-1548.txt
    hpcMsi-20101003-1555.txt
    hpcMsi-20101003-1600.txt
    hpcMsi-20101003-1620.txt
    hpcMsi-20101020-1315.txt
    setup-20101003-1547.txt
    setup-20101003-1548.txt
    setup-20101003-1555.txt
    setup-20101003-1600.txt
    setup-20101003-1620.txt
    setup-20101020-1315.txt
    upgradeV2ToV3-20101003-1547.txt
    upgradeV2ToV3-20101003-1548.txt
    upgradeV2ToV3-20101003-1555.txt

     

    Thank you!

     

    Thursday, October 21, 2010 10:16 AM
  • Hi,

    Could you send me your latest log files:

    setup-20101020-1315.txt

    hpcMsi-20101020-1315.txt

    My email address is lutom@microsoft.com

    Thanks,
    Łukasz

    Thursday, October 21, 2010 6:16 PM
  • Hi,

    One of possible reasons for current situation may be, that the permissions for:

    HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager

    registry key has been altered. Looks like this entry cannot be accessed while starting the service or trying to perform installation.

    To check if this is true you can try to run:

    reg query HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager

    as the user which tries to perform the installation. The following result will be a confirmation:

    ERROR: Access is denied.

    In such case you may try to find more about current permissions status by running 'regedit', navigating to the mentioned key and selecting 'Permissions' from its context menu.

    Thanks,
    Łukasz

    Monday, October 25, 2010 3:01 PM
  • After some fighting with this thing I got almost everything installed, HPC Pack 2008 R2 MS-MPI Redistributable Pack, The clients components, now I'm stuck at the same place I was... I can't start the HpcNode Manager Service to finish installing the server components. I refered back to your suggestion:I ran reg query HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager

    and this is what I got.

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HpcNodeManager
     Type REG_DWORD 0x10
     Start REG_DWORD 0x2
     ErrorControl REG_DWORD 0x1
     ImagePath REG_EXPAND_SZ "D:\Program Files\Microsoft HPC Pack 2008 R2\Bin\HpcNodeManager.exe"
    
     DisplayName REG_SZ HPC Node Manager Service
     DependOnService REG_MULTI_SZ rpcss
     ObjectName REG_SZ LocalSystem
     Description REG_SZ Manages processes for
     applications that run on a Windows HPC Server cluster.
     FailureActions REG_BINARY 8051010001000000010000000300000014000000010000003075000001000000307500000100000030750000
    

     

    The permissions seems to be OK, but I cannot start the service.

    and the logs only tell me this:

     

    Product: Microsoft HPC Pack 2008 R2 Server Components -- Error 1920. The HPC Node Manager Service (HpcNodeManager) 
    failed to start. For more information about this error, review the Windows HPC Server event log in
    Event Viewer, under Applications and Services Logs.

     

    Tuesday, October 26, 2010 8:34 AM
  • I've looked through some of the troubleshooting done so far.  It appears your question is going to need more in depth troubleshooting, looking into permissions, group policy/local security settings and so forth, which falls into the paid support category .  Please visit the below link to see the various paid support options that are available to better meet your needs. http://support.microsoft.com/default.aspx?id=fh;en-us;offerprophone
    Tuesday, October 26, 2010 6:43 PM
  • I'm having the same trouble. 4 out of 5 nodes in a homogeneous cluster installed correctly, but the final one hangs during the "Starting Services" step of server tools installation.

    Was this issue resolved?

    Sunday, November 7, 2010 2:36 AM
  • We found two group policies that prevented the service from starting. We're still not quite sure why.
    Monday, November 8, 2010 4:16 PM
  • Which policies were they?
    Monday, November 8, 2010 6:58 PM
  • Rather messy, large policies. Nothing in them obviously caused the problem. Contact me via email if you want the full policy export file.  elansey@gmail.com
    Monday, November 8, 2010 7:08 PM
  • Ah, thanks. Not sure that I'm skilled enough to parse them anyhow. I see that this R2 system was an upgrade vs. clean install, maybe inherited some old conflicting policies. Will wipe and try again.
    Monday, November 8, 2010 7:11 PM
  • Just to update - I wiped the drive, installed R2, and everything came out roses. Must have been some grandfathered policy as elansey describes. 
    Wednesday, January 12, 2011 6:01 AM
  • Did you ever figure out what part of the policy could be killing it, I am having the same problems and I can not remove the policy that is on it by program restriction. Interesting thing is, mine seems to works sometimes, then on reboot stops working on only SOME clients in the Same OU getting the SAME Group policy.

    Thank you,

    Dennis West

    Tuesday, October 8, 2013 5:56 PM
  • Lukasz, I am not sure this was ever answered, the root casue is "SOMETHING IN SOME POLICY" but what? I am having the same issues on machines hardened to required Govt. Standards and can not figure out what setting it coud be at all.
    Tuesday, October 8, 2013 10:33 PM
  • I had the same problem in HPC Pack 2012 R2 "HPC node manager service (HPCNodeManager) failed to start. Verify that you have sufficient privileges to start system services"

    This was also due to group policies, and in particular for me it was FIPSAlgorithmPolicy.

    I extracted the log files in C:\Program Files\Microsoft HPC Pack 2012\Data\LogFiles\ using this command

    cd C:\Program Files\Microsoft HPC Pack 2012\Data\LogFiles\hpctrace 
    parselog hpc*.bin -s

    and found this:

    HPC Node Manager Service startup failed due to This implementation is not part of the Windows Platform FIPS validated cryptographic algorithms

    I disabled FIPS algorithm policy in the registry HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy ("Enabled" to 0 - then reboot) and then reinstalled and it all worked fine.

    This thread is now quite old but hopefully this helps someone out!


    Tuesday, June 10, 2014 2:04 PM
  • can you please tell us what was the exact issue in the group policies? I'm facing the same issue with HPC Pack 2012.

    Thanks!

    Thursday, November 13, 2014 12:43 AM
  • Changed the WCF service to just run as local system account instead of specifying it in the service panel.
    Sunday, November 10, 2019 6:17 PM