locked
Job scheduler fails - SQL connection pool errors reported (HPC Pack 2008 R2 SP3, SQL Server 2008) RRS feed

  • Question

  • Hi,

    Our HPC cluster (HPC Pack 2008 R2 SP3) has been running for several weeks with no issues. One morning, all jobs started failing. When looking at the event logs, we saw many errors related to SQL server connection pooling (example below). Restarting the head node server resolved the issues. The jobs running at the time were not unusually large or many.

    Has anyone else experienced this issue? Are there ways to prevent this from occuring?

    Thanks,

    Sanjeev

    <Provider Name="Microsoft-HPC-Scheduler" Guid="{5B169E40-A3C7-4419-A919-87CD93F2964D}" /> 
    <Channel>Microsoft-HPC-Scheduler/Operational</Channel>
      -<EventData>
      <Data Name="Message">Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.</Data>
      <Data Name="ExceptionString">Exception detail: System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached. at System.Data.ProviderBase.DbConnectionFactory.GetConnection(DbConnection owningConnection) at System.Data.ProviderBase.DbConnectionClosed.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory) at System.Data.SqlClient.SqlConnection.Open() at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.GetDatabaseConnection() at Microsoft.Hpc.Scheduler.Store.DatabaseConnection..ctor(SchedulerStoreInternal store) at Microsoft.Hpc.Scheduler.Store.MultiTableQuery.ExecuteSingleRowRead(Int32 id) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.GetObjectProps(QueryContextBase ctx, Int32 itemId, PropertyId[] ids) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.Object_GetProps(ConnectionToken& token, ObjectType obType, Int32 obId, PropertyId[] ids, StoreProperty[]& props) Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.Object_GetProps(ConnectionToken& token, ObjectType obType, Int32 obId, PropertyId[] ids, StoreProperty[]& props) at Microsoft.Hpc.Scheduler.Store.StoreServer.Object_GetProps(ObjectType obType, Int32 obId, PropertyId[] ids, StoreProperty[]& props) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.GetPropsFromServer(ObjectType obType, Int32 itemId, PropertyId[] propertyIds) at Microsoft.Hpc.Scheduler.PerfCounter.PerfCountersModifier.Sync(Object timerState) at Microsoft.Hpc.SerializedTimer.Tick(Object state) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading._TimerCallback.PerformTimerCallback(Object state)</Data>

    • Edited by SanjeevB Thursday, May 2, 2013 2:44 PM fixed typo
    Wednesday, May 1, 2013 6:30 PM

All replies

  • If this situation happens again run this command

    netstat -anb -p tcp | find "TIME_WAIT"

    Maybe for some reason all available sockets on your node with SQL server are in waiting state.You may also run this command with /c switch to count sockets in TIME_WAIT state in the loop with a few sec interval and when jobs start failing you would have at least socket state history.

    I periodically clean-up my SQL server by removing past job records from Scheduler database.If I don't do that SQL server responsitivity is gradually getting worst.


    Daniel Drypczewski

    Wednesday, May 8, 2013 1:47 AM