locked
Error from Node: <compute node name>: The job identifier XYZ specified for the user 'domain\user' is already in use RRS feed

  • Question

  • Hi,

    Some of our jobs are failing with on a HPC Server 2012 (NOT R2) cluster.  The job typically spawns about 3000 tasks; and about half way though the job run the tasks start failing with the error message as seen in the subject line.  

    • The error is first reported from a compute node that has previously executed several hundred tasks successfully.  All subsequent tasks on that compute node then fail with the same error. The job eventually fails.
    • The compute node is not constant:  tasks failed on a different compute node when we tried to reproduce the error.
    • The failures are occasional but always seem to happen for large jobs (the most important ones!).
    • The job is a normal batch job

    The error message appears to indicate an internal HPC Server issue so we analyzed the HPC Server logs.  There following errors logged in the "Microsoft-HPC-Scheduler/Operational" event log around the time the tasks start failing. Can you please explain me what might be causing these task failures?  is the SQL Server error leading to the eventual task failure? Perhaps the SQL Server is being heavily loaded.  Many thanks in advance.

    Microsoft-HPC-Scheduler/Operational" event log

    Event1.1:

    The operation could not be completed because the affected object is already in use by the scheduler.  Please try again later.
    [Store] 
    Exception detail: Microsoft.Hpc.Scheduler.Properties.SchedulerException: The operation could not be completed because the affected object is already in use by the scheduler.  Please try again later.
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction(Boolean shouldRetry)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.RunTransaction(ConnectionToken& token, StoreTransaction transaction)
    Current stack:    at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception e, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.RunTransaction(ConnectionToken& token, StoreTransaction transaction)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.RunTransaction(StoreTransactionWrapper wrapper)
       at Microsoft.Hpc.Scheduler.ResourceController.MonitorTransaction.Commit()
       at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.Run()
       at Microsoft.Hpc.Scheduler.ResourceController.MonitorThread.RunMonitors()
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()

    Event1.2 (same time as Event1.1):

    Expected to update 1 rows, but actually updated 0, for SQL command:
    SET NOCOUNT ON;
    SET NOCOUNT OFF;
    SET NOCOUNT OFF;
    UPDATE Job SET
    ChangeTime = @Param0,RequiredNodes = @Param1,TaskLevelUpdateTime = @Param2,MinMaxUpdateTime = @Param3,ComputedMinCores = @Param4,ComputedMaxCores = @Param5
    WHERE ID = 165 AND timestamp <= 0x000000000049D471
    SET NOCOUNT ON;

    [Store] 
    Exception detail: Microsoft.Hpc.Scheduler.Store.OptimisticLockViolationException: Expected to update 1 rows, but actually updated 0, for SQL command:
    SET NOCOUNT ON;
    SET NOCOUNT OFF;
    SET NOCOUNT OFF;
    UPDATE Job SET
    ChangeTime = @Param0,RequiredNodes = @Param1,TaskLevelUpdateTime = @Param2,MinMaxUpdateTime = @Param3,ComputedMinCores = @Param4,ComputedMaxCores = @Param5
    WHERE ID = 165 AND timestamp <= 0x000000000049D471
    SET NOCOUNT ON;

       at Microsoft.Hpc.Scheduler.Store.WriteTableQuery.ExecuteUpdate(DatabaseConnection db, ICollection`1 idRange)
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsContainer.Execute(SchedulerStoreInternal store, DatabaseConnection db, TransactionProcessor tp, Byte[] timeStamp)
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsToDB()
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps()
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction(Boolean shouldRetry)
    Current stack:    at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception e, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction(Boolean shouldRetry)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.RunTransaction(ConnectionToken& token, StoreTransaction transaction)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.RunTransaction(StoreTransactionWrapper wrapper)
       at Microsoft.Hpc.Scheduler.ResourceController.MonitorTransaction.Commit()
       at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.Run()
       at Microsoft.Hpc.Scheduler.ResourceController.MonitorThread.RunMonitors()
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()

    Event 2.1:

    The operation could not be completed because the affected object is already in use by the scheduler.  Please try again later.
    [Store] 
    Exception detail: Microsoft.Hpc.Scheduler.Properties.SchedulerException: The operation could not be completed because the affected object is already in use by the scheduler.  Please try again later.
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction(Boolean shouldRetry)
       at Microsoft.Hpc.Scheduler.Store.JobQueryContext.TouchJobs(SchedulerStoreInternal store)
    Current stack:    at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception e, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.Store.JobQueryContext.TouchJobs(SchedulerStoreInternal store)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal._HouseKeeper()
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()

    Event 2.2 (same time as event 2.1):

    Expected to update 1 rows, but actually updated 0, for SQL command:
    SET NOCOUNT ON;
    SET NOCOUNT OFF;
    SET NOCOUNT OFF;
    UPDATE Job SET
    ChangeTime = @Param0
    WHERE ID = 165 AND timestamp <= 0x000000000049B354
    SET NOCOUNT ON;

    [Store] 
    Exception detail: Microsoft.Hpc.Scheduler.Store.OptimisticLockViolationException: Expected to update 1 rows, but actually updated 0, for SQL command:
    SET NOCOUNT ON;
    SET NOCOUNT OFF;
    SET NOCOUNT OFF;
    UPDATE Job SET
    ChangeTime = @Param0
    WHERE ID = 165 AND timestamp <= 0x000000000049B354
    SET NOCOUNT ON;

       at Microsoft.Hpc.Scheduler.Store.WriteTableQuery.ExecuteUpdate(DatabaseConnection db, ICollection`1 idRange)
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsContainer.Execute(SchedulerStoreInternal store, DatabaseConnection db, TransactionProcessor tp, Byte[] timeStamp)
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.SetPropsToDB()
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.WriteProps()
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction(Boolean shouldRetry)
    Current stack:    at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception e, String message, Object[] args)
       at Microsoft.Hpc.Scheduler.Store.TransactionProcessor.RunTransaction(Boolean shouldRetry)
       at Microsoft.Hpc.Scheduler.Store.JobQueryContext.TouchJobs(SchedulerStoreInternal store)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal._HouseKeeper()
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()



    Tuesday, June 9, 2015 3:57 AM

All replies

  • The error message means that when scheduler try to startJob XYZ on Compute node, it finds that there is already a job entity with the same job id XYZ on the compute node and it won't belong to the same user account.

    In that case, it is quite possible caused by SQL overloaded as failed to get the job info including the userAccount info. Thus please check whether the time for the error from compute node is the same as the SQL timeout error.

    It would also good to check whether you need upgrade your SQL, consider:

    1. Express version to Standard

    2. Remote dedicated SQL server instead of installing SQL on headnode

    3. High disk IO and enough RAM

    etc.


    Qiufang Shi

    Wednesday, June 10, 2015 2:25 AM