none
Job count not matching

    Question

  • Hello,
    While looking at some stats, we noticed that the count of jobs in the 'Job Throughput Report' is different that the number of jobs in 'Job Management > All Jobs'

    For example, in May...
    Job Throughput Report: 395 failed jobs
    If I look at all the failed jobs in 'Job Management > All Jobs > Failed' I have a 114. 

    Where is the discrepancy?

    Wednesday, 6 June 2018 3:03 PM

Answers

  • Hi I think your job has been retried four times and all failed. And and each retry will have a requeue ID for the job so that internally we treat them differently. And thus this is the reason why you get higher fail jobs.

    For example, if you have a job failed, and cluster admin requeued it and job passed, under this situation, we still think you have one failed job in the history.


    Qiufang Shi

    Friday, 8 June 2018 3:31 AM

All replies

  • HI, could you check the failed jobs from the reporting database to see what's the difference?

    Did you have failed admin jobs in May?


    Qiufang Shi

    Thursday, 7 June 2018 2:37 AM
  • Think I found part of the answer. The stored procedure "HpcReportingSp.GetJobHistory"... which I'm guessing is one of the procedures that "Charts and Reports" uses... Returns what appears to be the number of failed tasks in a job.

    Whereas looking at 'Job Management > All Jobs > Failed' only returns a single JobID.

    For example, we had a failed JobID 54043. Looking in Job Management, I have a single entry:

    54043 Cisl-cft-106.req74055 Failed [UserID] 100% 05/10/2018 10:03 1-100 Cores Task 54043.1 failed. Please check the failed task for more details on the failure.

    The Stored Procedure "HpcReportingSp.GetJobHistory" returns 4 entries. Same JobID but different RequestIDs:

    54043 0 Cisl-cft-106.req74055 Failed 02:41.1 02:03.0 02:03.2 [UserID]
    54043 1 Cisl-cft-106.req74055 Failed 03:19.6 02:41.8 02:41.9 [UserID]
    54043 2 Cisl-cft-106.req74055 Failed 03:57.4 03:19.8 03:19.9 [UserID]
    54043 3 Cisl-cft-106.req74055 Failed 04:35.5 03:58.0 03:58.1 [UserID]

    So is the Stored Procedures query returning the number of failed tasks within a job?





    Thursday, 7 June 2018 2:57 PM
  • UPDATE:
    While in Job Manager, I looked at the details of failed JobID 54043. Viewing the failed task, I see:

    [05/10/18 10:03:58.501][54043] Begin

    Trouble creating directory: D: try_attempts remaining: 1
    Exception in mkdirs catch(...)
    [05/10/18 10:04:04.548][54043] I copied the files
    Trouble creating directory: D: try_attempts remaining: 3
    Trouble creating directory: D: try_attempts remaining: 2
    Trouble creating directory: D: try_attempts remaining: 1

    I'm guessing the 4 entries from my previous post reflect the initial attempt, then the repeated failed retries?
    Just looking for confirmation that I'm correct (or not) in what I'm seeing.

    Thursday, 7 June 2018 3:11 PM
  • Hi I think your job has been retried four times and all failed. And and each retry will have a requeue ID for the job so that internally we treat them differently. And thus this is the reason why you get higher fail jobs.

    For example, if you have a job failed, and cluster admin requeued it and job passed, under this situation, we still think you have one failed job in the history.


    Qiufang Shi

    Friday, 8 June 2018 3:31 AM
  • That's what I suspected. Thank you for verifying!
    Friday, 8 June 2018 1:36 PM