none
Historical job-queue records? RRS feed

  • Question

  • How do I control retention of job records in the Win2k8 HPC scheduling?

    We need to keep them around for a long time (a year +) for accounting purposes.

    Thanks,
    -Luke
    Wednesday, August 12, 2009 6:37 PM

Answers

  • Luke,
    Jobs are retained in the database for 5 days by default.  The number of days for which they are stored can be adjusted using the TTLCompletedJobs cluster parameter, accessible either via the "cluscfg setparams" command, or through the scheduler options dialog (HPC Cluster Manager -> Job Management -> Options -> Job Scheduler Configuration -> Job History).

    You should realize that setting this may have implications for performance of your database, and of course on the size of the database.  There are more details on that available in this whitepaper: http://go.microsoft.com/fwlink/?LinkId=137791

    Thanks!
    Josh

    -Josh
    Thursday, August 13, 2009 6:15 PM
    Moderator

All replies

  • Luke,
    Jobs are retained in the database for 5 days by default.  The number of days for which they are stored can be adjusted using the TTLCompletedJobs cluster parameter, accessible either via the "cluscfg setparams" command, or through the scheduler options dialog (HPC Cluster Manager -> Job Management -> Options -> Job Scheduler Configuration -> Job History).

    You should realize that setting this may have implications for performance of your database, and of course on the size of the database.  There are more details on that available in this whitepaper: http://go.microsoft.com/fwlink/?LinkId=137791

    Thanks!
    Josh

    -Josh
    Thursday, August 13, 2009 6:15 PM
    Moderator
  • Thanks -- that answers my question!

    We've observed those performance issues on Win2k3, but the need for historical data remains.

    We worked around the problem by creating a "historical records" database, and then we run a script to manually migrate the records over.  The performance grinds to near a halt when we have more than a million job-records in the database on that system.  We have a somewhat beefier database-server configured for the Win2k8 server.

    Our script (of unknown origin) for the Win2k3 cluster looks like this:
    -- ===========================================================================
    -- MoveAndDeleteJobRows.sql
    --
    -- Copies JobQueue and TaskQueue rows from the primary CCPClusterService
    -- database into the backup ClusterRecords database, ignoring rows that may
    -- already have been copied, and then deletes the copied rows from
    -- CCPClusterService.
    --
    -- This script is intended to be run regularly. It allows complete,
    -- unabridged historical records of cluster jobs to be maintained, while
    -- keeping the primary database small to work around performance and
    -- stability issues with the Compute Cluster Manager/Scheduler.
    -- ===========================================================================
    
    BEGIN TRANSACTION;
    
    -- ---------------------------------------------------------------------------
    -- 1: Copy new JobQueue rows from primary database to backup database
    --
    -- We're copying only those rows that the scheduler has marked finished
    -- (EndTime isn't null) and whose EndTime is at least 24 hours old (i.e.,
    -- we let 24 hours go by before copying the job entry into the backup
    -- database).
    --
    -- We must take care to NOT copy job rows that already exist in the backup
    -- database (as determined by pkey match), otherwise the insert will fail.
    -- ---------------------------------------------------------------------------
    
    SET IDENTITY_INSERT ClusterRecords.dbo.Jobqueue ON;
    
    INSERT INTO ClusterRecords.dbo.JobQueue
    (
    	ID, State, InternalState, CreateTime, SubmitTime, StartTime, EndTime,
    	MinNumCPUs, MaxNumCPUs, RunUntilCanceled, AskedNodes, Priority,
    	Exclusive, RunTime, MaxTask, Name, Project, SoftwareLicense,
    	Backfill, Owner, JobUser, Password, Message, NumberOfRequeues
    )
    SELECT
    	ID, State, InternalState, CreateTime, SubmitTime, StartTime, EndTime,
    	MinNumCPUs, MaxNumCPUs, RunUntilCanceled, AskedNodes, Priority,
    	Exclusive, RunTime, MaxTask, Name, Project, SoftwareLicense,
    	Backfill, Owner, JobUser, Password, Message, NumberOfRequeues
    FROM CCPClusterService.dbo.JobQueue
    WHERE CCPClusterService.dbo.JobQueue.ID
    	NOT IN (SELECT ID FROM ClusterRecords.dbo.JobQueue)
    AND CCPClusterService.dbo.JobQueue.EndTime IS NOT NULL
    AND DATEDIFF(hour, CCPClusterService.dbo.JobQueue.EndTime, getdate()) > 24;
    
    -- ---------------------------------------------------------------------------
    -- 2: Copy new TaskQueue rows from primary database to backup database
    --
    -- We're copying tasks whose associated job has already been copied (as
    -- determined by an ID match). There is no time checking here -- as soon
    -- as we copy over a job row, we then immediately copy over all of that
    -- job's task rows.
    -- 
    -- We must take care to NOT copy job rows that already exist in the backup
    -- database (as determined by pkey match), otherwise the insert will fail.
    -- ---------------------------------------------------------------------------
    
    INSERT INTO ClusterRecords.dbo.TaskQueue
    (
    	ID, JobID, Name, State, InternalState, CreateTime, SubmitTime,
    	StartTime, EndTime, NumberOfRequeues, CommandLine, Rerunnable,
    	RunTime, MinNumCPUs, MaxNumCPUs, AskedNodes, Exclusive,
    	Stdin, Stdout, Stderr, WorkDir, Depend, Message, ExitCode
    )
    SELECT
    	ID, JobID, Name, State, InternalState, CreateTime, SubmitTime,
    	StartTime, EndTime, NumberOfRequeues, CommandLine, Rerunnable,
    	RunTime, MinNumCPUs, MaxNumCPUs, AskedNodes, Exclusive,
    	Stdin, Stdout, Stderr, WorkDir, Depend, Message, ExitCode
    FROM CCPClusterService.dbo.TaskQueue
    WHERE EXISTS
    	(
    		SELECT ID
    		FROM ClusterRecords.dbo.JobQueue
    		WHERE ID = CCPClusterService.dbo.TaskQueue.JobID
    	)
    AND NOT EXISTS
    	(
    		SELECT ID, JobID
    		FROM ClusterRecords.dbo.TaskQueue
    		WHERE ID = CCPClusterService.dbo.TaskQueue.ID
    		AND JobID = CCPClusterService.dbo.TaskQueue.JobID
    	);
    
    -- ---------------------------------------------------------------------------
    -- 3: Remove duplicated TaskQueue rows from primary database
    --
    -- We'll remove all task rows whose associated job has been marked finished
    -- (EndTime isn't null) and whose EndTime is at least 48 hours old (i.e.,
    -- we let 48 hours go by before deleting a task entry).
    --
    -- For maximum safety (okay, paranoia), we verify (via an ID match) that
    -- the deleted rows have been copied to the backup database.
    -- ---------------------------------------------------------------------------
    
    DELETE tq
    FROM
    	CCPClusterService.dbo.TaskQueue tq,
    	CCPClusterService.dbo.JobQueue jq
    WHERE tq.JobID = jq.ID
    AND EXISTS
    	(
    		SELECT ID, JobID
    		FROM ClusterRecords.dbo.TaskQueue
    		WHERE ID = tq.ID
    		AND JobID = tq.JobID
    	)
    AND jq.EndTime IS NOT NULL
    AND DATEDIFF(hour, jq.EndTime, getdate()) > 48;
    
    -- ---------------------------------------------------------------------------
    -- 4: Remove duplicated JobQueue rows from primary database
    --
    -- We'll remove all job rows which the scheduler has marked finished
    -- (EndTime isn't null) and whose EndTime is at least 48 hours old (i.e.,
    -- we let 48 hours go by before deleting a job entry).
    --
    -- For maximum safety (okay, paranoia), we verify (via an ID match) that
    -- the deleted rows have been copied to the backup database.
    -- ---------------------------------------------------------------------------
    
    DELETE FROM CCPClusterService.dbo.JobQueue
    WHERE ID IN (SELECT ID FROM ClusterRecords.dbo.JobQueue)
    AND EndTime IS NOT NULL
    AND DATEDIFF(hour, EndTime, getdate()) > 48;
    
    -- ---------------------------------------------------------------------------
    
    COMMIT;
    I then create a view which contains records from both the cluster and historical databases, and run my accounting queries against that.
    • Edited by Luke Scharf Monday, August 17, 2009 3:25 PM Added a detail
    Monday, August 17, 2009 2:34 PM