Job completed but not all task are executed
-
2011年7月15日 12:38
In some rare cases the job shows 'Finished' but not all tasks are executed. The jobs suddenly stops without any error message. Does somebody else have this experience?
所有回覆
-
2011年7月19日 22:35
Can you check the job scheduler event logs under EventViewer, "Applications and Services Logs" -> "Microsoft" -> "HPC" -> "Scheduler"? It may have something useful.
Liwei
-
2011年7月20日 7:32
In my EventViewer -> "Applications and Services Logs" -> "Microsoft" -> "HPC" I do not have a "Scheduler". Do I have to add it somehow?
Thanks,
Nadin
-
2011年8月15日 23:35
We have had this problem off and on for Hpc 2008 R2 SP1 as well as also in R1. Jobs will end as 'Finished' even though they have tasks still in the queued and configured states. There are numerous errors of these types below. From the Event Viewer location that Liwei asked about.
ERROR #1
exec SP_UpdateTaskGroupMaxMinCores 33978, 33977, 0, 1; Exception System.Data.SqlClient.SqlException: Transaction (Process ID 51) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteNonQuery() ERROR #2
An SQL exception occurred while running the transaction ExceptionString Exception detail: Microsoft.Hpc.Scheduler.Store.StoreTransactionSqlException: An SQL exception occurred while running the transaction ---> System.Data.SqlClient.SqlException: Transaction (Process ID 70) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlCommand.RunExecuteNonQueryTds(String methodName, Boolean async) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe) at System.Data.SqlClient.SqlCommand.ExecuteNonQuery() at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteNonQuery() --- End of inner exception stack trace --- at Microsoft.Hpc.Scheduler.Store.StoreSqlCommand.ExecuteNonQuery() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.TransactionSqlCommand.Execute(DatabaseConnection db) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.TaskGroup_UpdateGroupMaxMin(ConnectionToken token, Int32 jobId, Int32 groupId) at Microsoft.Hpc.Scheduler.Store.JobEx.UpdateTaskGroup(Int32 groupId) at Microsoft.Hpc.Scheduler.JobResource.JobResource.UpdateJobMinMax(SchedulerJobInternal job, PropertyId minProp, PropertyId maxProp, PropertyId computedMinProp, PropertyId computedMaxProp) at Microsoft.Hpc.Scheduler.JobResource.JobResource.UpdateJobMinMax(SchedulerJobInternal job) at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.UpdateJobResource() at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.HandleRunning() at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.StateMachine() at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.Run() at Microsoft.Hpc.Scheduler.ResourceController.MonitorThread.RunMonitors() at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart() ERROR #3
Expected to update 37 rows, but actually updated 32, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 19:18:50.897' WHERE ID IN (33075,33900,33901,33902,33903,33904,33905,33906,33907,33909,33910,33912,33913,33914,33915,33916,33917,33918,33919,33920,33922,33923,33924,33925,33926,33927,33928,33929,33930,33931,33932,33933,33934,33935,33936,33937,33939) AND timestamp <= 0x0000000005470E31 SET NOCOUNT ON; ExceptionString Exception detail: Microsoft.Hpc.Scheduler.Store.OptimisticLockViolationException: Expected to update 37 rows, but actually updated 32, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 19:18:50.897' WHERE ID IN (33075,33900,33901,33902,33903,33904,33905,33906,33907,33909,33910,33912,33913,33914,33915,33916,33917,33918,33919,33920,33922,33923,33924,33925,33926,33927,33928,33929,33930,33931,33932,33933,33934,33935,33936,33937,33939) AND timestamp <= 0x0000000005470E31 SET NOCOUNT ON; at Microsoft.Hpc.Scheduler.Store.DatabaseConnection.EndBatchMode() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsToDB() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() at Microsoft.Hpc.Scheduler.Store.JobQueryContext.TouchJobs(SchedulerStoreInternal store) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal._HouseKeeper() at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart() ERROR #4
Expected to update 2 rows, but actually updated 1, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 20:15:46.704' WHERE ID IN (33075,33975) AND timestamp <= 0x0000000005484FF5 SET NOCOUNT ON; ExceptionString Exception detail: Microsoft.Hpc.Scheduler.Store.OptimisticLockViolationException: Expected to update 2 rows, but actually updated 1, for SQL command: SET NOCOUNT OFF; SET NOCOUNT ON; SET NOCOUNT OFF; UPDATE Job SET ChangeTime = N'2011-08-15 20:15:46.704' WHERE ID IN (33075,33975) AND timestamp <= 0x0000000005484FF5 SET NOCOUNT ON; at Microsoft.Hpc.Scheduler.Store.DatabaseConnection.EndBatchMode() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsToDB() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps() at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() Current stack: at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction() at Microsoft.Hpc.Scheduler.Store.JobQueryContext.TouchJobs(SchedulerStoreInternal store) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal._HouseKeeper() at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ThreadHelper.ThreadStart()
-
2011年11月25日 9:57
Hello together,
my customer claims the exact same issue - but unfortunately we have absolute no errors-
The failure occur rarely on any Job/Task.
Any other idea or experience?
This is the Job Template -
JobTemplate Name="JobTemplateA" Description="Runtime: 180 Days, on hpcwnodes" CreateTime="14.04.2011 15:43:59">
<TemplateItem PropertyName="MinCores" Default="1" MinVal="1" MaxVal="2147483647" />
<TemplateItem PropertyName="MaxCores" Default="1" MinVal="1" MaxVal="2147483647" />
<TemplateItem PropertyName="MinSockets" Default="1" MinVal="1" MaxVal="2147483647" />
<TemplateItem PropertyName="MaxSockets" Default="1" MinVal="1" MaxVal="2147483647" />
<TemplateItem PropertyName="MinNodes" Default="1" MinVal="1" MaxVal="2147483647" /
<TemplateItem PropertyName="MaxNodes" Default="1" MinVal="1" MaxVal="2147483647" />
<TemplateItem PropertyName="UnitType" Default="Node" ValueRange="" />
<TemplateItem PropertyName="IsExclusive" Default="False" ValueRange="" />
<TemplateItem PropertyName="RunUntilCanceled" Default="False" ValueRange="" />
<TemplateItem PropertyName="ExpandedPriority" Default="2000" MinVal="0" MaxVal="2000" />
<TemplateItem PropertyName="AutoCalculateMax" Default="True" ValueRange="" />
<TemplateItem PropertyName="AutoCalculateMin" Default="True" ValueRange="" />
<TemplateItem PropertyName="FailOnTaskFailure" Default="False" ValueRange="" />
<TemplateItem PropertyName="Preemptable" Default="True" ValueRange="" />
<TemplateItem PropertyName="NodeGroups" Default="HPCWNodes" ValueRange="" RequiredValues="" />
<TemplateItem P
Thank you for any input or ideas!!!
Sylvia