locked
HPC scheduler receiving timeouts on Refresh() calls RRS feed

  • Question

  • Hi all,

    We are having an intermittent issue with the HPC scheduler whereby a Refresh() call is timing out with the error:

    Microsoft.Hpc.Scheduler.Properties.SchedulerException: Database exception:Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

    When this error occurs the compute nodes are flat out at 100% but only happens after several hours. There is nothing else running on the grid all the interface is singlethreaded and not doing much more than making Refresh() calls periodically.

    Here are the relevant log details:

    The event log shows:

    Log Name: Windows HPC Server

    Source: Microsoft-Windows-HPCServer

    Date: 7/27/2010 8:13:33 AM

    Event ID: 24

    Task Category: None

    Level: Error

    Keywords:

    User: SYSTEM

    Description:

    The scheduler got a SQL exception.

    Event Xml:

    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

    <System>

    <Provider Name="Microsoft-Windows-HPCServer" Guid="{5b169e40-a3c7-4419-a919-87cd93f2964d}" />

    <EventID>24</EventID>

    <Version>0</Version>

    <Level>2</Level>

    <Task>0</Task>

    <Opcode>0</Opcode>

    <Keywords>0x2000000000000000</Keywords>

    <TimeCreated SystemTime="2010-07-27T02:43:33.922Z" />

    <EventRecordID>17170</EventRecordID>

    <Correlation />

    <Execution ProcessID="2632" ThreadID="2916" />

    <Channel>Windows HPC Server</Channel>

    <Computer>sts091.nousblr-odc.local</Computer>

    <Security UserID="S-1-5-18" />

    </System>

    <EventData>

    <Data Name="SQLQuery">SELECT Tasks_Main2.ID

    ,Jobs.UnitType

    ,Tasks_Main2.InstanceValue

    ,Tasks_Settings2.CommandLine

    ,Tasks_Settings2.Runtime

    ,Tasks_Settings2.MinCores

    ,Tasks_Settings2.MaxCores

    ,Tasks_Settings2.MinNodes

    ,Tasks_Settings2.MaxNodes

    ,Tasks_Settings2.MinSockets

    ,Tasks_Settings2.MaxSockets

    ,Tasks_Settings2.IsRerunnable

    ,Tasks_Main2.RequeueCount

    ,Tasks_Settings2.DependsOnTasks

    ,Tasks_Settings2.RequiredNodes

    ,Tasks_Main2.ParentJobID

    ,Tasks_Settings2.IsExclusive

    ,Tasks_Settings2.NiceId

    ,Tasks_Main2.State

    ,Tasks_Main2.InstanceId

    ,ParametricTaskCounters.Canceled

    ,ParametricTaskCounters.Failed

    ,ParametricTaskCounters.Running

    ,ParametricTaskCounters.Queued

    ,Jobs.State

    ,Tasks_Settings2.Name

    ,Tasks_Settings2.GroupId

    ,Tasks_Settings2.IsParametric

    ,Tasks_Settings2.StartValue

    ,Tasks_Settings2.EndValue

    ,Tasks_Settings2.IncrementValue

     

    FROM Jobs

    INNER JOIN Tasks_Main2 ON Tasks_Main2.ParentJobID=Jobs.ID

    INNER JOIN Tasks_Settings2 ON Tasks_Settings2.RecordId=Tasks_Main2.RecordId

    INNER JOIN ParametricTaskCounters ON ParametricTaskCounters.RecordId=Tasks_Settings2.RecordId

     

    WHERE Tasks_Main2.InstanceId&lt;=@param0 AND Tasks_Main2.State=@param1 AND Jobs.State&gt;=@param2 AND Jobs.State&lt;=@param3 ORDER BY Tasks_Main2.ParentJobID ASC

    -- 570375073

    </Data>

    <Data Name="Exception">System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

    at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)

    at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)

    at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)

    at System.Data.SqlClient.SqlDataReader.ConsumeMetaData()

    at System.Data.SqlClient.SqlDataReader.get_MetaData()

    at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)

    at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)

    at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)

    at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)

    at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)

    at System.Data.SqlClient.SqlCommand.ExecuteReader()

    at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteReader()</Data>

    </EventData>

    </Event>

    The SQL logs show the following:

    'SELECT Tasks_Main2.ParentJobID<nl/><c/>Tasks_Main2.ID<nl/><c/>Tasks_Main2.RequestCancel<nl/><c/>Tasks_Main2.State<nl/><c/>Tasks_Main2.InstanceId<nl/><c/>ParametricTaskCounters.Canceled<nl/><c/>ParametricTaskCounters.Failed<nl/><c/>ParametricTaskCounters.Running<nl/><c/>ParametricTaskCounters.Queued<nl/><nl/>FROM Jobs<nl/>INNER JOIN Tasks_Main2 ON Tasks_Main2.ParentJobID=Jobs.ID<nl/>INNER JOIN Tasks_Settings2 ON Tasks_Settings2.RecordId=Tasks_Main2.RecordId<nl/>INNER JOIN ParametricTaskCounters ON ParametricTaskCounters.RecordId=Tasks_Settings2.RecordId<nl/><nl/>WHERE Tasks_Main2.InstanceId>=@param0 AND Tasks_Main2.RequestCancel<>@param1 AND Tasks_Main2.State<>@param2 AND Jobs.State=@param3<nl/>-- 761869105<nl/>'<c/> 'System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.<nl/> at System.Data.SqlClient.SqlConnection.OnError(SqlException exception<c/> Boolean breakConnection)<nl/> at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)<nl/> at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior<c/> SqlCommand cmdHandler<c/> SqlDataReader dataStream<c/> BulkCopySimpleResultSet bulkCopyHandler<c/> TdsParserStateObject stateObj)<nl/> at System.Data.SqlClient.SqlDataReader.ConsumeMetaData()<nl/> at System.Data.SqlClient.SqlDataReader.get_MetaData()<nl/> at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds<c/> RunBehavior runBehavior<c/> String resetOptionsString)<nl/> at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior<c/> RunBehavior runBehavior<c/> Boolean returnStream<c/> Boolean async)<nl/> at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior<c/> RunBehavior runBehavior<c/> Boolean returnStream<c/> String method<c/> DbAsyncResult result)<nl/> at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior<c/> RunBehavior runBehavior<c/> Boolean returnStream<c/> String method)<nl/> at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior<c/> String method)<nl/> at System.Data.SqlClient.SqlCommand.ExecuteReader()<nl/> at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteReader()',(0),24,,sts091.nousblr-odc.local

    Wednesday, July 28, 2010 8:32 AM

Answers

All replies