none
Sometimes job with all tasks finished do not move to finished state RRS feed

  • Question

  • Hello,

    I am using HPC 2012.
    Lately we have experienced some failures where a job has all of its tasks completed but somehow it gets stuck and do not transition to finish state. It will stay in running state until cancelled manually.
    When I look at the HPC logs in the event viewer it seems to me that it may be related to a failure on HPC Sql server database. Indeed I see errors like :

     - Job 1563072 can't finish because the data is out of sync, please restart scheduler service to fix this issue
     - SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\Program Files\Microsoft HPC Pack 2012\SQLDB\HPCScheduler_log.ldf] in database id 6.  The OS file handle is 0x0000000000000934.  The offset of the latest long I/O is: 0x0000000b0f0400

     - The scheduler was unable to commit a transaction.

    Has anyone encountered this issue before ? Maybe is there a patch that can fix it ?

    Thanks

    Friday, August 18, 2017 2:13 PM

Answers

  • Hi,

      What the exact HPC Pack version are you using? And how long have you kept for your job history data? -- Default is 5 days. 

      Looking at the error, looks like some SQL transaction failed, usually it shall be retried. But if there is too many job/task entries in the DB, the transaction might never succeed. To fix the issue, you could try below approaches:

    1. Upgrade your cluster to Update 3, and apply this QFE: https://www.microsoft.com/en-us/download/details.aspx?id=54772 , as we "Improve performance (added a few SQL index) when there is huge historical data;"

    2. Upgrade your SQL HW, to make it SQL more powerful

    3. Reduce the the lengthy of job history


    Qiufang Shi

    • Marked as answer by cguevaramari Thursday, September 14, 2017 4:50 PM
    Monday, August 21, 2017 2:39 AM

All replies

  • Hi,

      What the exact HPC Pack version are you using? And how long have you kept for your job history data? -- Default is 5 days. 

      Looking at the error, looks like some SQL transaction failed, usually it shall be retried. But if there is too many job/task entries in the DB, the transaction might never succeed. To fix the issue, you could try below approaches:

    1. Upgrade your cluster to Update 3, and apply this QFE: https://www.microsoft.com/en-us/download/details.aspx?id=54772 , as we "Improve performance (added a few SQL index) when there is huge historical data;"

    2. Upgrade your SQL HW, to make it SQL more powerful

    3. Reduce the the lengthy of job history


    Qiufang Shi

    • Marked as answer by cguevaramari Thursday, September 14, 2017 4:50 PM
    Monday, August 21, 2017 2:39 AM
  • Hi,

    Thanks for the feedback. To answer to your questions :
     - I am using HPC 2012 R2 update 2
     - Our job history is currently 3 days long.

    As you mention, when checking the scheduler logs i see about 4 retries of the sql query unfortunately all of them fail.
    Is there a way to configure the time the scheduler takes between sql query retries ? Right now the 4 retries are done within a few minutes.

    Also, is there a way to connect to the scheduler database ? I tried to connect remotely with sql server management studio but it looks like database is not accessible.

    Thanks for your help
    Monday, August 21, 2017 12:37 PM
  • Please use SQL admin to connect to your SQL database.


    Qiufang Shi

    Tuesday, August 22, 2017 1:41 AM
  • We finally moved our head node to a bigger machine and the issues are gone
    Thursday, September 14, 2017 4:51 PM