locked
SchedulerException: Could not register with the server. Try again later. RRS feed

  • Question

  • Hi all , 

    Some times I get the following exception while trying to connect to WinHPC scheduler : 

     

     

    ServerMicrosoft.Hpc.Scheduler.Properties.SchedulerException: Could not register with the server. Try again later.

    at Microsoft.Hpc.Scheduler.Store.StoreServer._Connect()

    at Microsoft.Hpc.Scheduler.Store.StoreServer.Connect(String server, Int32 port)

    at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc..ctor(String server, Int32 port)

    at Microsoft.Hpc.Scheduler.Store.SchedulerStore.Connect(String server)

    at Microsoft.Hpc.Scheduler.Scheduler.Connect(String cluster) 

     

    I wonder what can cause that exception and how can I avoid it  (I'm using WinHPC R2 SP1) .

     

    Thanks in advance , 

    Shai.

     

    Sunday, March 27, 2011 2:40 PM

All replies

  • Could you check whether 1.the connectivity between client and scheduler is good; 2.the HpcScheduler service is running correctly?
    Monday, March 28, 2011 4:45 AM
  • How can I check that the HpcScheduler service is running correctly?

    Thanks in advance , 

    Shai.

    Tuesday, March 29, 2011 12:13 PM
  • Are there any developments in finding a cause for this error? I'm asking because I am currently facing the same issue and would appreciate your incite on the issue. In my case I have a script that has a loop where it calls services that uses HPC and then keeps monitoring the state of the tasks that it has launched, and sometimes this error occurs...
    Thursday, May 5, 2011 9:47 AM
  • How can I check that the HpcScheduler service is running correctly?

    Thanks in advance , 

    Shai.


    You can open the services in your server and check out if the HPC Job Scheduler Service is running. If not, Start.

    However, if the Service is running you can try Restarting the Service.

     Please let me know if this resolves your problem.

    Thanks,

    Sridutt


    Friday, May 6, 2011 6:06 AM
  • Are there any developments in finding a cause for this error? I'm asking because I am currently facing the same issue and would appreciate your incite on the issue. In my case I have a script that has a loop where it calls services that uses HPC and then keeps monitoring the state of the tasks that it has launched, and sometimes this error occurs...


    Try checking if the HPC Job Scheduler Service is running. If, connecting to the node itself is a problem then HPC Node Manager Service may not be running.

    Please let me know if this resolves your problem.

    Thanks,

    Sridutt


    Friday, May 6, 2011 6:08 AM
  • Hi sridutt bhalachandra Well, the HPC Job Scheduler Service is in fact running... From the tests and troubleshooting I've done I don't believe that the cause for this issue is related to the HPC Job Scheduler Service not being up. I tried to stop the service and execute my test script and I got a different error (which leads me to conclude that job not being up isn't the cause of the above mentioned error), the exception that thrown when the job is stoped is:

    Message: Unexpected exception occurred. ExceptionId: ed1105c4-9f87-4212-815e-a1d8fe2e067a. Exception details: System.IO.IOException: The write operation failed, see inner exception. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) at System.Runtime.Remoting.Channels.SocketStream.Write(Byte[] buffer, Int32 offset, Int32 count) at System.Net.Security.NegotiateStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) at System.Net.Security.NegotiateStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) --- End of inner exception stack trace --- Server stack trace: at System.Net.Security.NegotiateStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) at System.Net.Security.NegotiateStream.Write(Byte[] buffer, Int32 offset, Int32 count) at System.Runtime.Remoting.Channels.ChunkedMemoryStream.WriteTo(Stream stream) at System.Runtime.Remoting.Channels.Tcp.TcpClientSocketHandler.GetRequestStream(IMessage msg, Int32 contentLength, ITransportHeaders headers) at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.SendRequestWithRetry(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream) at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Stream& responseStream) at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg) Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at Microsoft.Hpc.Scheduler.Store.ISchedulerStoreInternal.Register(String clientSource, String userName, ConnectionRole role, Version clientVersion, ConnectionToken& token, UserPrivilege& privilege, Version& serverVersion, Dictionary`2& serverProps) at Microsoft.Hpc.Scheduler.Store.StoreServer.RegisterWithServer() at Microsoft.Hpc.Scheduler.Store.StoreServer._Connect() at Microsoft.Hpc.Scheduler.Store.StoreServer.Connect(String server, Int32 port) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc..ctor(String server, Int32 port) at Microsoft.Hpc.Scheduler.Scheduler.Connect(String cluster)

    [...]

    Also, the "ServerMicrosoft.Hpc.Scheduler.Properties.SchedulerException: Could not register with the server. Try again later." error doesn't occur all the time, and there isn't a pattern or group of reproducible steps that I could find... It comes and goes without anyone doing anything (not even a reboot to the server) it simply starts working again... This is why my client is asking for an explanation for these strange errors, and until now I couldn't come up with an answer to give him...

    • Edited by marconsilva Wednesday, May 11, 2011 3:47 PM Coment formating was lost on submit (for whatever reason...)
    Wednesday, May 11, 2011 3:42 PM
  • Hi all , 

    Same behavior at our clusters , I deal with it by wrapping each scheduler call with the following retry block - 

         while (true)
          {
            try
            {
              clusterScheduler.SubmitJobById( jobId, userName, null);
              break;
            }
            catch (Exception e)
            {
              --schedulerRetries;
              if (schedulerRetries <= 0)
                throw e;
              Thread.Sleep(100);
            }
          }
    
    

    Regards , 

    Shai.

     

     

     

    Wednesday, May 18, 2011 3:02 PM
  • I see this same behavior as well on SP1 with all latest windows updates. I haven't tried SP2 beta to see if this is resolved, I will try the brute force method Shay described. I noticed the code works okay if on the head node ( where scheduler resides), if I run it on a different host then the problem increases in frequency.

    I noticed as well that the job command always work from remote hosts when this error occurs, scheduler is running fine, seems to be an issue when using the API

    Monday, May 23, 2011 5:27 PM