none
The scheduler was unable to start its socket listener.

    Question

  • Last night our HPC Scheduler became unresponsive - errors were received connecting to head node to access Job Manager / Cluster Manager - No connection could be made cause the the target machine actively refused it xx.x.xx.xxx:5800

    We think it became so inundated with jobs all at once that it crashed/.  A user had loaded up multiple(600+) 1 core jobs along with other users' jobs.  Error log is so full of errors I can't get back to the initial error, but these repeat throughout the HPC Scheduler Operational log

    Event ID 8:  An unexpected exception occurred. For more information about this exception, see the Details tab. 
     Additional data:
     Exception on starting server over IPv4: System.Net.Sockets.SocketException (0x80004005): Only one usage of each socket address (protocol/network address/port) is normally permitted
       at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress)
       at System.Net.Sockets.Socket.Bind(EndPoint localEP)
       at System.Net.Sockets.TcpListener.Start(Int32 backlog)
       at System.Runtime.Remoting.Channels.ExclusiveTcpListener.Start(Boolean exclusiveAddressUse)
       at System.Runtime.Remoting.Channels.Tcp.TcpServerChannel.StartListening(Object data)
       at System.Runtime.Remoting.Channels.Tcp.TcpServerChannel.SetupChannel()
       at System.Runtime.Remoting.Channels.Tcp.TcpServerChannel..ctor(IDictionary properties, IServerChannelSinkProvider sinkProvider, IAuthorizeRemotingConnection authorizeCallback)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.StartRemoting(Int32 storePort, Int32 eventPort)

    Event ID 26:  The scheduler was unable to start its socket listener.

    Event ID 8:  An unexpected exception occurred. For more information about this exception, see the Details tab. 

    Additional data:
     Scheduler can not connect to itself. It will retry. An unexpected exception occurred. For more information about this exception, see the Details tab. 

    Event ID 6:  [Store]  Failed to accept Tcpclient: System.Net.Sockets.SocketException (0x80004005): A blocking operation was interrupted by a call to WSACancelBlockingCall
       at System.Net.Sockets.Socket.Accept()
       at System.Net.Sockets.TcpListener.AcceptTcpClient()
       at Microsoft.Hpc.Scheduler.Store.RemoteEventAdvisor._ListenThread(Object ipRange) 

    Scheduler service showed 'running' this morning, but was unresponsive trying to restart.  Ended up rebooting server and now everything is running fine.  A few failed jobs, but several still completed that were in the queue even tho we couldn't get into Job Manager to see them.

    So questions:

    1)  Is there a limit to the number of jobs scheduler can maintain / receive at once - anything we can do via config / registry settings to allow Scheduler to handle multiple incoming job requests at once?

    Thanks.


    Thursday, August 23, 2018 4:22 PM

All replies

  • 1. For the crash issue, you could check the scheduler logs under %CCP_DATA%LogFiles\HPCScheduler\*.bin, you could use "logparser" tool to covert them to plain text file

    2. Currently you could write submission filter to refuse the job submission if user's quota meets. -- And also this is something we are looking into in our future release

    https://docs.microsoft.com/en-us/powershell/high-performance-computing/understanding-activation-and-submission-filters?view=hpc16-ps


    Qiufang Shi

    Wednesday, August 29, 2018 2:59 AM