HPC Node Manager will not start
-
Dienstag, 19. Oktober 2010 20:49
After using http://technet.microsoft.com/en-us/library/gg247477%28WS.10%29.aspx to enable the HPC_CreateConsole environment variable, the HPC Node Manager service stops as soon as it is started.
I see this error message in the Event Logs.
Service cannot be started. Microsoft.Hpc.Scheduler.Session.SessionException: Can't connect to the scheduler. ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 127.0.0.1:5800
Server stack trace:
at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
at System.Runtime.Remoting.Channels.RemoteConnection.CreateNewSocket(EndPoint ipEndPoint)
at System.Runtime.Remoting.Channels.RemoteConnection.CreateNewSocket()
at System.Runtime.Remoting.Channels.SocketCache.GetSocket(String machinePortAndSid, Boolean openNew)
at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.SendRequestWithRetry(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream)
at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Strea...Also, the service is not able to start on a server that did not get the registry commands. I did notice that although I did not change the registry on one of the servers, the registry changes still were applied to that machine.
I am not sure how to fix this problem.
Thanks,
Daniel
Alle Antworten
-
Mittwoch, 20. Oktober 2010 17:46Moderator
Hi Daniel,
Thanks a lot for reporting the issue. However, we will need some more information to understand the problem. I assume that you are using Windows HPC Server 2008 R2.
Was your cluster up and running before you made the changes related to HPC_CreateConsole? Could you run simple jobs on your cluster at the time.
Which of the specified options, clusrun or node template did you use to enable the HPC_CreateConsole setup?
Could you tell us a bit about the cluster you are trying to setup? How many compute nodes do you have and how are they connected to the headnode (specifically the network topology you chose while setting up the cluster)?
Are the compute nodes on your cluster domain joined?
Answers to these questions will allow us to help you better.
Thanks
sayantan
-
Mittwoch, 20. Oktober 2010 18:00I'm another admin in Daniel's group. Answers inline below:
I assume that you are using Windows HPC Server 2008 R2.
Yes.
Was your cluster up and running before you made the changes related to HPC_CreateConsole? Could you run simple jobs on your cluster at the time.
Yes. It was working fine.
Which of the specified options, clusrun or node template did you use to enable the HPC_CreateConsole setup?
We used the "Run command on node" tool from the HPC Node Manager GUI, and ran "reg add HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager /v HpcConsoleSupport /t REG_DWORD /d 1 /f & reg add HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System /v SoftwareSASGeneration /t REG_DWORD /d 1 /f" on all the nodes (we wanted all nodes to have this capability). The registry keys were added successfully, too.
Could you tell us a bit about the cluster you are trying to setup? How many compute nodes do you have and how are they connected to the headnode (specifically the network topology you chose while setting up the cluster)?
We have 10 compute nodes running Windows Server 2008 R2 + HPC Pack addon. They are all on the "Enteprise" network, although different subnets. Our head node is accessible from workstations and compute nodes, but compute nodes are only accessible from the head node (due to physical router/switch and demands of the overarching IT group here). The head node acts as an AD controller for the domain. When we updated the registry keys on the nodes, the service did not start on reboot; now the service also doesn't start on the head node, either.
Are the compute nodes on your cluster domain joined?
Yes.
Thanks so much for your help.
Eli.
- Bearbeitet elansey Mittwoch, 20. Oktober 2010 18:02 Corrected command line
-
Mittwoch, 20. Oktober 2010 19:47Moderator
Hi,
From the exception posted in the first mail, it seems that the scheduler service itself is having some trouble running. I would like to first figure out if the hpc scheduler service itself, is running. If it is not, I would like to try and restart it and see if it works. If it does not I would like to find the reason.
On the headnode, from an elevated command window (running as an admin), could you do a
sc query hpcscheduler
Does that show the hpcscheduler as running or stopped?
If it is not running, could you do (please note the time)
net start hpcscheduler
Does the command say that the service started?
If it did not, it would be great if you could check the hpc scheduler service's event log. Here is how to find the specifc one in the eventviewer under:
Application and Services Logs\Microsoft\Hpc\Scheduler\Operational.In this particular event log, could you see if there are any errors or warnings at the time at which you attempted the net start hpcscheduler?
This information will help us see if the hpcscheduler service itself is facing some problems.
If this service is working fine, we will investigate the hpc node manager service
thanks
sayantan
-
Donnerstag, 21. Oktober 2010 10:16
Hi Sayanta, I am another admin following up with this problem. I restarted the server few times with no luck. Before the first reply I completely uninstall HPC pack from the head node and when I tried to install it I get an error trying to start the service, but side tracking from that I ran the commands that you requested and this is what I get:
C:\Users\Administrator>sc query hpcscheduler
and
SERVICE_NAME: hpcscheduler
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 1077 (0x435)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
C:\Users\Administrator>net start hpcscheduler
and last
The HPC Job Scheduler Service service is starting.
The HPC Job Scheduler Service service could not be started.
The service did not report an error.
More help is available by typing NET HELPMSG 3534.C:\Users\Administrator>NET HELPMSG 3534
The service did not report an error.Now, coming back to trying to reinstall HPC Pack on the head node I get the following error
Service 'HPC Node Manager Service' (HpcNodeManager) could not be installed. Verify that you have sufficient privileges to install system services. [Cancel],[Retry],[Ignore]
If I hit Retry I get the same message If I hit Ignore or Cancel it stops the installation giving me a log path: C:\Windows\Temp\HPCSetupLogs
Note: I am running the installation as Administrator
I the logs folder I get too many files for me to post the actual log so I'm just going to list them and please tell me which one you want me to post.
hpcMsi-20101003-1547.txt
hpcMsi-20101003-1548.txt
hpcMsi-20101003-1555.txt
hpcMsi-20101003-1600.txt
hpcMsi-20101003-1620.txt
hpcMsi-20101020-1315.txt
setup-20101003-1547.txt
setup-20101003-1548.txt
setup-20101003-1555.txt
setup-20101003-1600.txt
setup-20101003-1620.txt
setup-20101020-1315.txt
upgradeV2ToV3-20101003-1547.txt
upgradeV2ToV3-20101003-1548.txt
upgradeV2ToV3-20101003-1555.txtThank you!
-
Donnerstag, 21. Oktober 2010 18:16
Hi,
Could you send me your latest log files:
setup-20101020-1315.txt
hpcMsi-20101020-1315.txt
My email address is lutom@microsoft.comThanks,
Łukasz -
Montag, 25. Oktober 2010 15:01
Hi,
One of possible reasons for current situation may be, that the permissions for:
HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager
registry key has been altered. Looks like this entry cannot be accessed while starting the service or trying to perform installation.
To check if this is true you can try to run:
reg query HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager
as the user which tries to perform the installation. The following result will be a confirmation:
ERROR: Access is denied.
In such case you may try to find more about current permissions status by running 'regedit', navigating to the mentioned key and selecting 'Permissions' from its context menu.
Thanks,
Łukasz -
Dienstag, 26. Oktober 2010 08:34
After some fighting with this thing I got almost everything installed, HPC Pack 2008 R2 MS-MPI Redistributable Pack, The clients components, now I'm stuck at the same place I was... I can't start the HpcNode Manager Service to finish installing the server components. I refered back to your suggestion:I ran reg query HKLM\SYSTEM\CurrentControlSet\Services\HpcNodeManager
and this is what I got.
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HpcNodeManager Type REG_DWORD 0x10 Start REG_DWORD 0x2 ErrorControl REG_DWORD 0x1 ImagePath REG_EXPAND_SZ "D:\Program Files\Microsoft HPC Pack 2008 R2\Bin\HpcNodeManager.exe" DisplayName REG_SZ HPC Node Manager Service DependOnService REG_MULTI_SZ rpcss ObjectName REG_SZ LocalSystem Description REG_SZ Manages processes for applications that run on a Windows HPC Server cluster. FailureActions REG_BINARY 8051010001000000010000000300000014000000010000003075000001000000307500000100000030750000
The permissions seems to be OK, but I cannot start the service.
and the logs only tell me this:
Product: Microsoft HPC Pack 2008 R2 Server Components -- Error 1920. The HPC Node Manager Service (HpcNodeManager)
failed to start. For more information about this error, review the Windows HPC Server event log in
Event Viewer, under Applications and Services Logs.
-
Dienstag, 26. Oktober 2010 18:43I've looked through some of the troubleshooting done so far. It appears your question is going to need more in depth troubleshooting, looking into permissions, group policy/local security settings and so forth, which falls into the paid support category . Please visit the below link to see the various paid support options that are available to better meet your needs. http://support.microsoft.com/default.aspx?id=fh;en-us;offerprophone
-
Sonntag, 7. November 2010 02:36
I'm having the same trouble. 4 out of 5 nodes in a homogeneous cluster installed correctly, but the final one hangs during the "Starting Services" step of server tools installation.
Was this issue resolved?
-
Montag, 8. November 2010 16:16
We found two group policies that prevented the service from starting. We're still not quite sure why.- Als Antwort vorgeschlagen Lukasz TomczykMicrosoft Employee Mittwoch, 17. November 2010 14:44
- Als Antwort markiert Don PatteeModerator Mittwoch, 12. Januar 2011 02:32
-
Montag, 8. November 2010 18:58Which policies were they?
-
Montag, 8. November 2010 19:08Rather messy, large policies. Nothing in them obviously caused the problem. Contact me via email if you want the full policy export file. elansey@gmail.com
-
Montag, 8. November 2010 19:11Ah, thanks. Not sure that I'm skilled enough to parse them anyhow. I see that this R2 system was an upgrade vs. clean install, maybe inherited some old conflicting policies. Will wipe and try again.
-
Mittwoch, 12. Januar 2011 06:01Just to update - I wiped the drive, installed R2, and everything came out roses. Must have been some grandfathered policy as elansey describes.