locked
HPC Job Scheduler crashing -- seeing SQL Exception by deadlocking sql statements RRS feed

  • Question


  • The HPC Job Scheduler service has been crashing on us.  There are errors in the COMPUTECLUSTER SQL instance that operations are deadlocking and one is being killed.  The problem has been increasingly occuring and is now happening multiple times a day.  It has been running stable for 10+ hours at a time or it will be unstable.  The number of runnig tasks does not seem to be a factor.

    Is there a patch or remedy for this?

    Below are examples of errors from the logs.

    -Scott

    Windows Event:
    #1
    SQLQuery -- SchedulerStore_TaskAllocatedNodes 

    Exception -- System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlDataReader.SetMetaData(_SqlMetaDataSet metaData, Boolean moreInfo) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlDataReader.ConsumeMetaData() at System.Data.SqlClient.SqlDataReader.get_MetaData() at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method) at System.Data.SqlClient.SqlCommand.ExecuteReader() at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteReader() 

    #2
    Message -- Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.</Data> 

    ExceptionString -- System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlDataReader.ReadInternal(Boolean setTimeout) at Microsoft.Hpc.Scheduler.Store.SqlUtils.ReadFromReader(SqlDataReader& reader, StoreSqlCommand cmd)</Data> 

    MS SQL error log:
    #1
    2009-05-20 04:22:02.63 spid4s      Deadlock encountered .... Printing deadlock information
    2009-05-20 04:22:02.63 spid4s      Wait-for graph
    2009-05-20 04:22:02.63 spid4s      
    2009-05-20 04:22:02.63 spid4s      Node:1

    2009-05-20 04:22:02.63 spid4s      KEY: 5:72057594041073664 (0200ab9e0158) CleanCnt:3 Mode:X Flags: 0x0
    2009-05-20 04:22:02.63 spid4s       Grant List 3:
    2009-05-20 04:22:02.63 spid4s         Owner:0x940F6300 Mode: X        Flg:0x0 Ref:0 Life:02000000 SPID:60 ECID:0 XactLockInfo: 0x8006479C
    2009-05-20 04:22:02.63 spid4s         SPID: 60 ECID: 0 Statement Type: UPDATE Line #: 26
    2009-05-20 04:22:02.63 spid4s         Input Buf: Language Event: UPDATE JobTaskCounters SET 
    Running=Running+1
    ,Dispatching=Dispatching-1
    WHERE JobID=51331

    UPDATE TaskGroupNames SET 
    Running=Running+1
    ,Dispatching=Dispatching-1
    WHERE Id=64229

    UPDATE TaskGroupNames SET 
    SumMaxCores=SumMaxCores+1
    ,SumMaxNod
    2009-05-20 04:22:02.63 spid4s       Requested By: 
    2009-05-20 04:22:02.63 spid4s         ResType:LockOwner Stype:'OR'Xdes:0x86825B20 Mode: S SPID:57 BatchID:0 ECID:0 TaskProxy:(0x2796C378) Value:0x6300e5a0 Cost:(0/0)
    2009-05-20 04:22:02.63 spid4s      
    2009-05-20 04:22:02.63 spid4s      Node:2

    2009-05-20 04:22:02.63 spid4s      KEY: 5:72057594038583296 (010086470766) CleanCnt:2 Mode:X Flags: 0x0
    2009-05-20 04:22:02.63 spid4s       Grant List 0:
    2009-05-20 04:22:02.63 spid4s         Owner:0xA6BD72A0 Mode: X        Flg:0x0 Ref:0 Life:02000000 SPID:57 ECID:0 XactLockInfo: 0x86825B44
    2009-05-20 04:22:02.63 spid4s         SPID: 57 ECID: 0 Statement Type: UPDATE Line #: 6
    2009-05-20 04:22:02.63 spid4s         Input Buf: RPC Event: Proc [Database Id = 5 Object Id = 1022626686]
    2009-05-20 04:22:02.63 spid4s       Requested By: 
    2009-05-20 04:22:02.63 spid4s         ResType:LockOwner Stype:'OR'Xdes:0x80064778 Mode: U SPID:60 BatchID:0 ECID:0 TaskProxy:(0xBD548378) Value:0x8fe5f200 Cost:(0/1816)
    2009-05-20 04:22:02.63 spid4s      
    2009-05-20 04:22:02.63 spid4s      Victim Resource Owner:
    2009-05-20 04:22:02.63 spid4s       ResType:LockOwner Stype:'OR'Xdes:0x86825B20 Mode: S SPID:57 BatchID:0 ECID:0 TaskProxy:(0x2796C378) Value:0x6300e5a0 Cost:(0/0)

    #2

    2009-05-20 04:42:40.27 spid4s      Deadlock encountered .... Printing deadlock information
    2009-05-20 04:42:40.27 spid4s      Wait-for graph
    2009-05-20 04:42:40.27 spid4s      
    2009-05-20 04:42:40.27 spid4s      Node:1

    2009-05-20 04:42:40.27 spid4s      KEY: 5:72057594049200128 (8600fc0e3589) CleanCnt:5 Mode:U Flags: 0x0
    2009-05-20 04:42:40.27 spid4s       Wait List:
    2009-05-20 04:42:40.27 spid4s         Owner:0x877A6C80 Mode: S        Flg:0x2 Ref:1 Life:00000000 SPID:66 ECID:0 XactLockInfo: 0x6B730474
    2009-05-20 04:42:40.27 spid4s         SPID: 66 ECID: 0 Statement Type: SELECT Line #: 16
    2009-05-20 04:42:40.27 spid4s         Input Buf: Language Event: (@_jobId int,@_taskId int,@_startTime datetime,@_changeTime datetime,@Param0 int)
    exec OpenTaskAllocationEntry @_jobId, @_taskId, @_startTime, @_changeTime, @Param0;



    2009-05-20 04:42:40.27 spid4s       Requested By: 
    2009-05-20 04:42:40.27 spid4s         ResType:LockOwner Stype:'OR'Xdes:0x2CBE63C8 Mode: S SPID:94 BatchID:0 ECID:0 TaskProxy:(0x3EB86378) Value:0x94218cc0 Cost:(0/0)
    2009-05-20 04:42:40.27 spid4s      
    2009-05-20 04:42:40.27 spid4s      Node:2

    2009-05-20 04:42:40.27 spid4s      KEY: 5:72057594049200128 (8600fc0e3589) CleanCnt:5 Mode:U Flags: 0x0
    2009-05-20 04:42:40.27 spid4s       Convert List:
    2009-05-20 04:42:40.27 spid4s         Owner:0x5A23D5E0 Mode: X        Flg:0x6 Ref:1 Life:02000000 SPID:60 ECID:0 XactLockInfo: 0x2CBE6794
    2009-05-20 04:42:40.27 spid4s         SPID: 60 ECID: 0 Statement Type: UPDATE Line #: 10
    2009-05-20 04:42:40.27 spid4s         Input Buf: RPC Event: Proc [Database Id = 5 Object Id = 1938105945]
    2009-05-20 04:42:40.27 spid4s       Requested By: 
    2009-05-20 04:42:40.27 spid4s         ResType:LockOwner Stype:'OR'Xdes:0x6B730450 Mode: S SPID:66 BatchID:0 ECID:0 TaskProxy:(0x5ED8A378) Value:0x877a6c80 Cost:(0/1388)
    2009-05-20 04:42:40.27 spid4s      
    2009-05-20 04:42:40.27 spid4s      Node:3

    2009-05-20 04:42:40.27 spid4s      KEY: 5:72057594049200128 (8600fc0e3589) CleanCnt:5 Mode:U Flags: 0x0
    2009-05-20 04:42:40.27 spid4s       Grant List 2:
    2009-05-20 04:42:40.27 spid4s         Owner:0x5508D180 Mode: S        Flg:0x0 Ref:1 Life:00000000 SPID:89 ECID:0 XactLockInfo: 0x238B559C
    2009-05-20 04:42:40.27 spid4s         SPID: 89 ECID: 0 Statement Type: SELECT Line #: 1
    2009-05-20 04:42:40.27 spid4s         Input Buf: Language Event: SELECT Jobs.ID
    ,Jobs.Name
    ,Schd_Owners.Name
    ,Jobs.Username
    ,Jobs.Priority
    ,Schd_ProjectNames.Name
    ,Jobs.Runtime
    ,Jobs.SubmitTime
    ,Jobs.CreateTime
    ,Jobs.EndTime
    ,Jobs.StartTime
    ,Jobs.ChangeTime
    ,Jobs.State
    ,Jobs.PrevState
    ,Jobs.MinCores
    ,Job
    2009-05-20 04:42:40.27 spid4s       Requested By: 
    2009-05-20 04:42:40.27 spid4s         ResType:LockOwner Stype:'OR'Xdes:0x2CBE6770 Mode: X SPID:60 BatchID:0 ECID:0 TaskProxy:(0x2B4DE378) Value:0x5a23d5e0 Cost:(0/0)
    2009-05-20 04:42:40.27 spid4s      
    2009-05-20 04:42:40.27 spid4s      Node:4

    2009-05-20 04:42:40.27 spid4s      KEY: 5:72057594041073664 (5000d398ec59) CleanCnt:2 Mode:X Flags: 0x0
    2009-05-20 04:42:40.27 spid4s       Grant List 3:
    2009-05-20 04:42:40.27 spid4s         Owner:0xB8D503A0 Mode: X        Flg:0x0 Ref:0 Life:02000000 SPID:66 ECID:0 XactLockInfo: 0x6B730474
    2009-05-20 04:42:40.27 spid4s       Requested By: 
    2009-05-20 04:42:40.27 spid4s         ResType:LockOwner Stype:'OR'Xdes:0x238B5578 Mode: S SPID:89 BatchID:0 ECID:0 TaskProxy:(0x999FE378) Value:0x8015d340 Cost:(0/0)
    2009-05-20 04:42:40.27 spid4s      
    2009-05-20 04:42:40.27 spid4s      Victim Resource Owner:
    2009-05-20 04:42:40.27 spid4s       ResType:LockOwner Stype:'OR'Xdes:0x238B5578 Mode: S SPID:89 BatchID:0 ECID:0 TaskProxy:(0x999FE378) Value:0x8015d340 Cost:(0/0)

    • Moved by Josh BarnardModerator Friday, May 29, 2009 5:42 PM This is related to problems with the job scheduler. (From:Windows HPC Server Deployment, Management, and Administration)
    Wednesday, May 20, 2009 9:02 PM

Answers

  • Hi Scott

    Thanks very much for helping out with the repro, it's much appreciated.

    regarding your questions:
    C) No, you dont need to restart the services, we will pick up the change dynamically. Please remember to switch it back to 2 when you have got the repro as we'll generate a lot of logs at level 4.

    D) We are doing significant work around backup for our service pack, and will have some better guidance in the next few weeks. The scheduler crashing during backup is interesting and we will take a look at that. If we did end up needing to quiesce services to guarantee backup would that be a significant issue for you?

    I am interested in general in your need for daily backup, as we imagine most customers do not treat the job scheduler data as that business critical - by default we purge old jobs from the system after 5 days, and consider the reporting DB to hold the important information for management reporting from that time on. We are aware of specific customers with fijudicial requirements to preserve all data, but thought this was pretty rare. It would be great to understand more about your specific requirements ( perhaps offline? ).

    E) SP1 will be within the next three months. I can't give any timeframe for v3 publically yet i'm afraid.

    cheers
    jeff

     
    Friday, May 22, 2009 3:23 AM

All replies

  • Hi Mithlan

    Can i ask a couple of clarifying questions on the above:

    1) Does the scheduler service crash when these deadlocks occur ( does the hpcscheduler service stop and restart itself ), or are the symptoms that you get event log errors but things continue on ok
    2) When you say these are getting more regular, how regular are we talking - how many times an hour would you say you get the deadlock errors thrown
    3) What sort of concurrent user load do you have on the system - how many users monitoring things in the job / admin console, and how many simultaneously submitting jobs

    In general we would expect to see a few deadlocks under load, but these should always be benign: the system should retry the operations, nothing should ever crash, and no data should ever be lost or get into an inconsistent state. If any of these are happening its certainly unexpected and something we'd like to help root cause with you.

    There are a few general things you can do to mitigate the chances of hitting deadlocks, mostly around database maintenance. The ease of doing some of these will depend on whether you have full sql installed ( as there most can be driven through the UI ):
    1) Its worth setting up a database maintenance plan and rebuilding indexes periodically, perhaps once a week during a quiet period
    2) You could consider moving the transaction log for the DB onto a separate physical disk than the HPC database itself so that db reads and writes dont contend so much.
    3) If you are expecting a reasonable rate of incoming jobs, its worth presizing the database rather than letting it grow on demand ( we install a 256MB db and grow by 10% as needed by default, which is not optimal in some scenarios ) as this will help with fragmentation and unexpected stalls

    It's also worth doing some basic monitoring of the headnode to see if there is an obvious disk bottleneck which could be contributing to the situation. We have seen this problem in cases where jobs are writing a lot of data back to head node shares and contending with the disk log writes etc. Probably the easiest way to do this is to use perfmon and monitor the physical disk / average seconds per read and physical disk / average seconds per write counters for the disks on the head node. You are looking for < 20ms or so in each case with numbers up in the 100s of ms indicating a serious problem.

    Wednesday, May 20, 2009 10:28 PM
  • Jeff,

    1) Yes, the scheduler crashes at the same time as the dead locks are occuring.  Over the default setting it will restart automatically 3 times but stays off after the third, which is when we notice it.  
    **Setting the service to restart infinitely is messing up the job states.  Jobs with failed tasks are ending up in the finished state.
    2) The dead locks occur in a group.  These storms of dead locks have occured 6 times over the last 3 days.  Not at any specific time of the day or with any regular time frame between each occurance.
    3) We have 3-5 people at most with the job manager open.  Job and task submission is done programatically.  Tasks are added to a job while it is runinng.  Job can have upwards to 3000-4000 tasks in them when done though most are in the sub 1000 range.

    DB is full SQL 2005
    1) Maintance plan is already in place and runs daily.
    2) Log and DB files are already on seperate partions
    3) The LDF and MDF files both have been pre-expanded to give more than adequate space.  We wanted to ensure that we were not going through expansion during a period of heavy load.

    We have watched the headnode via perfmon and there do not seem to be any bottlenecks around disk I/O or CPU.  The disk access is < 20ms even during heavier loads.  Memory on the headnode is not maxed out.

    A) Is it possible use SQL 2008 instead of 2005?
    B) Is it possible to off load the database to another machine or must it be on the same server w/COMPUTECLUSTER as the instance?

    Thursday, May 21, 2009 12:53 AM
  • Hi Mithlan

    Regarding your two questions.
    A) no, its not possible with our current v2 pieces to run against sql2008. This will be possible when our service pack comes out
    B) no unfortunately, although this feature will ( probably ) be available in our V3 product.

    I dont think sql2008 will help this situation though - it sounds like the workload is reasonably light, and the number of concurrent users is small so we should be handling things fine. You also seem to have done all the right things with your database setup...
     
    There are three problems i think you are having:
    1) the scheduler crashes in response to a deadlock storm
    2) the system is ending up in an inconsistent state when you continually try to restart the scheduler service when it crashes
    3) there are periodic deadlock storms that lead to 1) and 2)

    Regarding tracking down the issues - we will try to repro in house here, but if we could do some analysis on your cluster that would be really helpful ( production use might make this hard though? )

    If you are able to turn up logging etc, the following would be really helpful:
    1) set hklm\software\microsoft\hpc\tracelevel to 4 in the registry and leave things like this when the issue reproes.
    2) run a sql profile of the system during the repro after adding in the deadlock callgraph event to the profile events to capture.

    The fact the deadlocks happen periodically is interesting as we do have some scheduled tasks that occur - cleanup of old jobs, aggregation of feeds to reporting, cleanup of perf counters etc. It would be interesting to know if these storms appear at the same time of the day, or at 12 hour intervals roughly etc?

    If you are able to grab profiles and verbose logs, we'd need you to upload them here ( we'd need the logs from program files\microsoft hpc pack\data\logfiles\ and the event viewer set from application and services logs\microsoft\Hpc Server ). Probably the easiest way to arrange this is if you contact me offline so we can set up an ftp drop. My email is jeffb@microsoft.com

    Thursday, May 21, 2009 3:26 AM
  • Jeff,

    We are getting setup with the additional logging and sql profile.  I will contact you as soon as we see a repo of the problem.

    Thank you for the other info on SQL 2008.  

    C) Do we need to restart the HPC Services for the logging reg key to be picked up?  We will try to do so when the current jobs complete but if we don't have to it would be better.
    D) Another issue we are having is around backing up the database.  When the backup runs the scheduler dies.  The database is small the back up only takes 1-2 minutes, so even if it is locking the whole databse the time frame is quite small.  We use Backup Exec but have also tried others and have the same problem.  We would feel much better if we could get daily backups.
    E) Is there a broad time frame you know of for when SP1 / v3 will be out?  (ie: next few months, next year, tomorrow?)

    -Scott
    Thursday, May 21, 2009 5:07 PM
  • Hi Scott

    Thanks very much for helping out with the repro, it's much appreciated.

    regarding your questions:
    C) No, you dont need to restart the services, we will pick up the change dynamically. Please remember to switch it back to 2 when you have got the repro as we'll generate a lot of logs at level 4.

    D) We are doing significant work around backup for our service pack, and will have some better guidance in the next few weeks. The scheduler crashing during backup is interesting and we will take a look at that. If we did end up needing to quiesce services to guarantee backup would that be a significant issue for you?

    I am interested in general in your need for daily backup, as we imagine most customers do not treat the job scheduler data as that business critical - by default we purge old jobs from the system after 5 days, and consider the reporting DB to hold the important information for management reporting from that time on. We are aware of specific customers with fijudicial requirements to preserve all data, but thought this was pretty rare. It would be great to understand more about your specific requirements ( perhaps offline? ).

    E) SP1 will be within the next three months. I can't give any timeframe for v3 publically yet i'm afraid.

    cheers
    jeff

     
    Friday, May 22, 2009 3:23 AM
  • Jeff,

    The daily backup has saved us a couple of times when the nodes were stuck in draining (see there is a hot fix for that) and were unable to get them back except by restoring from tape.  It is part of our standard recovery procedures.

    I sent you the repo logs in an email to your account listed above.  Hope they help in finding the problem.  Feel free to send me more questions off-line should you have any.

    -Scott
    Wednesday, May 27, 2009 8:36 PM
  • Hi Scott

    Nothing's arrived so far. Can you try sending me a mail without attachments please to make sure that communication channels are at least working... ( in case our / your firewalls stripped attachments or some such )

    cheers
    jeff
    Thursday, May 28, 2009 12:40 AM
  • Hi Jeff, your answer to Scott's question, quoted below, has me interested...

    "B) Is it possible to off load the database to another machine or must it be on the same server w/COMPUTECLUSTER as the instance?
    B) no unfortunately, although this feature will ( probably ) be available in our V3 product."

    I have a couple of questions on this, which I guess you may or may not be able to comment on. I have played around with using a remote DB instance under the current release and it does seem to work under light load, but is I'm sure an unsupported configuration.
    If available is it intended that this feature be available when clustering headnodes? The current requirement is to install a clustered SQL Server instance on the headnodes themselves, which adds significant expense (via licensing and support overhead) to the excercise, particularly when an enterprise clustered SQL Server resource may already be in place.
    How likely is probable :)

    Thanks very much
    Dan





    Thursday, May 28, 2009 8:33 AM
  • Jeff,

    Sent you an email with out attachments.

    -Scott
    Friday, May 29, 2009 5:38 PM