locked
Job Scheduler/Database Issue at 2am Every Day RRS feed

  • Question

  • We have 2 HPC clusters, both with the same problem.  If a job needs to access the database to schedule tasks at 2am, I receive an event ID 0,  Microsoft.Hpc.Scheduler.Properties.SchedulerException: Database exception:Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.RunTransaction(StoreTransactionWrapper wrapper) at Microsoft.Hpc.Scheduler.ResourceController.TaskMonitor.Run()

    This leaves the nodes that were associated with the job in a hung state and the job stuck in Canceling.  I then have to manualy remove entries for the job in the dbo.resources table and reboot the nodes.  I am going to run a database trace tonight, but was curious if anyone knows of an Internal HPC process that happens at 2am that I can reschedule or turn off that would cause this behavior?

    Tuesday, June 29, 2010 3:23 PM

Answers

  • You can set TtlCompletedJobs to max 50000 days to avoid this. To run it manually you can set it back to default and restart HpcScheduler service.

    Which version of Windows HPC do you use?

    Tuesday, June 29, 2010 6:34 PM

All replies

  • Hi,

    Internal HPC process, which you are hitting is the DB cleanup for jobs, which are older than 5 days.

    This can be configured with cluster parameters ('cluscfg setparams' command):

    TtlCompletedJobs  : 5  (jobs older than 5 (default) days will be removed)
    JobCleanUpHour    : 2  (scheduler will check for old jobs at 2AM (default) every day)

    Regards,
    Łukasz

    Tuesday, June 29, 2010 4:10 PM
  • is there a value to turn this off completely or a way to set it up to manually run.  We are starting to have jobs that will run for days and are failing due to this process
    Tuesday, June 29, 2010 5:53 PM
  • You can set TtlCompletedJobs to max 50000 days to avoid this. To run it manually you can set it back to default and restart HpcScheduler service.

    Which version of Windows HPC do you use?

    Tuesday, June 29, 2010 6:34 PM
  • HPC 2008 SP1 on one cluster and SP2 on the other.
    Tuesday, June 29, 2010 7:01 PM