locked
Job completed but not all task are executed RRS feed

  • Question

  • In some rare cases the job shows 'Finished' but not all tasks are executed. The jobs suddenly stops without any error message. Does somebody else have this experience?

    Friday, July 15, 2011 12:38 PM

All replies

  • Can you check the job scheduler event logs under EventViewer, "Applications and Services Logs" -> "Microsoft" -> "HPC" -> "Scheduler"? It may have something useful.

    Liwei

     

    Tuesday, July 19, 2011 10:35 PM
  • In my EventViewer -> "Applications and Services Logs" -> "Microsoft" -> "HPC" I do not have a "Scheduler". Do I have to add it somehow?

    Thanks,

    Nadin

    Wednesday, July 20, 2011 7:32 AM
  • We have had this problem off and on for Hpc 2008 R2 SP1 as well as also in R1.  Jobs will end as 'Finished' even though they have tasks still in the queued and configured states.  There are numerous errors of these types below.  From the Event Viewer location that Liwei asked about.

     

    ERROR #1

     

    exec SP_UpdateTaskGroupMaxMinCores 33978, 33977, 0, 1;
    Exception System.Data.SqlClient.SqlException: Transaction (Process ID 51) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteNonQuery()

    ERROR #2

     

    An SQL exception occurred while running the transaction
    ExceptionString Exception detail: Microsoft.Hpc.Scheduler.Store.StoreTransactionSqlException: An SQL exception occurred while running the transaction ---> System.Data.SqlClient.SqlException: Transaction (Process ID 70) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteNonQuery() --- End of inner exception stack trace --- at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteNonQuery() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.TransactionSqlCommand.Execute(DatabaseConnection db) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.TaskGroup_UpdateGroupMaxMin(ConnectionToken token, Int32 jobId, Int32 groupId) at Microsoft.Hpc.Scheduler.Store.JobEx.UpdateTaskGroup(Int32 groupId) at Microsoft.Hpc.Scheduler.JobResource.JobResource.UpdateJobMinMax(SchedulerJobInternal job, PropertyId minProp, PropertyId maxProp, PropertyId computedMinProp, PropertyId computedMaxProp) at Microsoft.Hpc.Scheduler.JobResource.JobResource.UpdateJobMinMax(SchedulerJobInternal job) at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.UpdateJobResource() at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.HandleRunning() at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.StateMachine() at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.Run() at Microsoft.Hpc.Scheduler.ResourceController.MonitorThread.RunMonitors() at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart()

    ERROR #3

     

    Expected to update 37 rows, but actually updated 32, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 19:18:50.897' WHERE ID IN (33075,33900,33901,33902,33903,33904,33905,33906,33907,33909,33910,33912,33913,33914,33915,33916,33917,33918,33919,33920,33922,33923,33924,33925,33926,33927,33928,33929,33930,33931,33932,33933,33934,33935,33936,33937,33939) AND timestamp <= 0x0000000005470E31 SET NOCOUNT ON;
    ExceptionString Exception detail: Microsoft.Hpc.Scheduler.Store.OptimisticLockViolationException: Expected to update 37 rows, but actually updated 32, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 19:18:50.897' WHERE ID IN (33075,33900,33901,33902,33903,33904,33905,33906,33907,33909,33910,33912,33913,33914,33915,33916,33917,33918,33919,33920,33922,33923,33924,33925,33926,33927,33928,33929,33930,33931,33932,33933,33934,33935,33936,33937,33939) AND timestamp <= 0x0000000005470E31 SET NOCOUNT ON; at Microsoft.Hpc.Scheduler.Store.DatabaseConnection.EndBatchMode() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsToDB() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() at Microsoft.Hpc.Scheduler.Store.JobQueryContext.TouchJobs(SchedulerStoreInternal store) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal._HouseKeeper() at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart()

    ERROR #4

     

    Expected to update 2 rows, but actually updated 1, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 20:15:46.704' WHERE ID IN (33075,33975) AND timestamp <= 0x0000000005484FF5 SET NOCOUNT ON;
    ExceptionString

    Exception detail: Microsoft.Hpc.Scheduler.Store.OptimisticLockViolationException: Expected to update 2 rows, but actually updated 1, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 20:15:46.704' WHERE ID IN (33075,33975) AND timestamp <= 0x0000000005484FF5 SET NOCOUNT ON; at Microsoft.Hpc.Scheduler.Store.DatabaseConnection.EndBatchMode() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsToDB() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() at Microsoft.Hpc.Scheduler.Store.JobQueryContext.TouchJobs(SchedulerStoreInternal store) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal._HouseKeeper() at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart()

     

     

    Monday, August 15, 2011 11:35 PM
  • Hello together,

    my customer claims the exact same issue - but unfortunately we have absolute no errors-

    The failure occur rarely on any Job/Task.

     

    Any other idea or experience?

    This is the Job Template -

    JobTemplate Name="JobTemplateA" Description="Runtime: 180 Days, on hpcwnodes" CreateTime="14.04.2011 15:43:59">

        <TemplateItem PropertyName="MinCores" Default="1" MinVal="1" MaxVal="2147483647" />

        <TemplateItem PropertyName="MaxCores" Default="1" MinVal="1" MaxVal="2147483647" />

        <TemplateItem PropertyName="MinSockets" Default="1" MinVal="1" MaxVal="2147483647" />

        <TemplateItem PropertyName="MaxSockets" Default="1" MinVal="1" MaxVal="2147483647" />

        <TemplateItem PropertyName="MinNodes" Default="1" MinVal="1" MaxVal="2147483647" /

        <TemplateItem PropertyName="MaxNodes" Default="1" MinVal="1" MaxVal="2147483647" />

        <TemplateItem PropertyName="UnitType" Default="Node" ValueRange="" />

        <TemplateItem PropertyName="IsExclusive" Default="False" ValueRange="" />

        <TemplateItem PropertyName="RunUntilCanceled" Default="False" ValueRange="" />

        <TemplateItem PropertyName="ExpandedPriority" Default="2000" MinVal="0" MaxVal="2000" />

        <TemplateItem PropertyName="AutoCalculateMax" Default="True" ValueRange="" />

        <TemplateItem PropertyName="AutoCalculateMin" Default="True" ValueRange="" />

        <TemplateItem PropertyName="FailOnTaskFailure" Default="False" ValueRange="" />

        <TemplateItem PropertyName="Preemptable" Default="True" ValueRange="" />

        <TemplateItem PropertyName="NodeGroups" Default="HPCWNodes" ValueRange="" RequiredValues="" />

        <TemplateItem P

    Thank you for any input or ideas!!!

     

    Sylvia

     

    Friday, November 25, 2011 9:57 AM