none
Unable to requeue a Job, Job goes very fast to State Failed.

    Question

  • Hello Together,

    we are running a Microsoft HPC 2012 R2 Cluster.
    We have the Problem, that we are not able to requeue some jobs.
    When we requeue such a job it goes fery fast in the Failed State.
    The question is, what is the reason for this, and what should we do to get such a Job requeued?

    Here some more details:

    When i look into the action log we only have such a info: (Requeue done at 09:22:54)

    06.03.2015 22:30:56  Job Canceled
    07.03.2015 09:22:54  Started
    07.03.2015 09:24:12  Job is being Preempted
    07.03.2015 09:24:13  Job Canceled
    07.03.2015 09:24:18  Started
    07.03.2015 09:24:47  Job is being Preempted
    07.03.2015 09:24:48  Job Canceled
    07.03.2015 09:24:53  Started
    07.03.2015 09:25:22  Job is being Preempted
    07.03.2015 09:25:23  Job Canceled
    07.03.2015 09:25:29  Started
    07.03.2015 09:25:57  Job is being Preempted
    07.03.2015 09:25:58  Job Canceled
    07.03.2015 09:26:07  Started
    07.03.2015 09:26:35  Job is being Preempted
    07.03.2015 09:26:36  Job Canceled
    07.03.2015 09:26:43  Started
    07.03.2015 09:27:11  Job is being Preempted
    07.03.2015 09:27:12  Job Canceled
    07.03.2015 09:27:19  Started
    07.03.2015 09:27:47  Job is being Preempted
    07.03.2015 09:27:47  Job Canceled
    07.03.2015 09:27:54  Started
    07.03.2015 09:28:23  Job is being Preempted
    07.03.2015 09:28:24  Job Canceled
    07.03.2015 09:28:30  Started
    07.03.2015 09:28:59  Job is being Preempted
    07.03.2015 09:29:00  Job Canceled
    07.03.2015 09:29:05  Started
    07.03.2015 09:29:33  Job is being Preempted
    07.03.2015 09:29:34  Job Canceled
    07.03.2015 09:29:43  Started
    07.03.2015 09:30:12  Job is being Preempted
    07.03.2015 09:30:13  Job Canceled
    07.03.2015 09:30:19  Started
    07.03.2015 09:30:49  Job is being Preempted
    07.03.2015 09:30:50  Job Canceled
    07.03.2015 09:30:56  Started
    07.03.2015 09:31:25  Job is being Preempted
    07.03.2015 09:31:27  Job Canceled
    07.03.2015 09:31:34  Started
    07.03.2015 09:32:06  Job is being Preempted
    07.03.2015 09:32:07  Job Canceled
    07.03.2015 09:32:14  Started
    07.03.2015 09:32:44  Job is being Preempted
    07.03.2015 09:32:45  Job Canceled
    07.03.2015 09:32:51  Started
    07.03.2015 09:33:19  Job is being Preempted
    07.03.2015 09:33:21  Job Canceled
    07.03.2015 09:33:28  Started
    07.03.2015 09:33:57  Job is being Preempted
    07.03.2015 09:33:58  Job Canceled
    07.03.2015 09:34:04  Started
    07.03.2015 09:34:33  Job is being Preempted
    07.03.2015 09:34:34  Job Canceled
    07.03.2015 09:34:39  Started
    07.03.2015 09:35:08  Job is being Preempted
    07.03.2015 09:35:09  Job Canceled
    07.03.2015 09:35:15  Started
    07.03.2015 09:35:43  Job is being Preempted
    07.03.2015 09:35:44  Job Canceled
    07.03.2015 09:35:51  Started
    07.03.2015 09:36:20  Job is being Preempted
    07.03.2015 09:36:21  Job Canceled
    07.03.2015 09:36:26  Started
    07.03.2015 09:36:55  Job is being Preempted
    07.03.2015 09:36:56  Job Canceled
    07.03.2015 09:37:05  Started
    07.03.2015 09:37:34  Job is being Preempted
    07.03.2015 09:37:34  Job Canceled
    07.03.2015 09:37:40  Started
    07.03.2015 09:38:09  Job is being Preempted
    07.03.2015 09:38:11  Job Canceled
    07.03.2015 09:38:19  Started
    07.03.2015 09:38:48  Job is being Preempted
    07.03.2015 09:38:49  Job Canceled
    07.03.2015 09:38:55  Started
    07.03.2015 09:39:25  Job is being Preempted
    07.03.2015 09:39:26  Job Canceled
    07.03.2015 09:39:31  Started
    07.03.2015 09:40:00  Job is being Preempted
    07.03.2015 09:40:01  Job Canceled
    07.03.2015 09:40:08  Started
    07.03.2015 09:40:38  Job is being Preempted
    07.03.2015 09:40:39  Job Canceled
    07.03.2015 09:40:46  Started
    07.03.2015 09:41:56  Job Failed

    And when we look on the detailed History of the Job, we found following Error Message:
    (With job view <Id> /detailed ...)

    Id                               : 21092
    Name                             : XXXXXXXXXXXXXXXXXXX
    SubmitTime                       : 06.03.2015 21:07:41
    CreateTime                       : 25.02.2015 14:50:34
    StartTime                        : 07.03.2015 09:40:46
    EndTime                          : 07.03.2015 09:41:56
    ChangeTime                       : 07.03.2015 09:41:56
    UnitType                         : Socket
    MinCores                         : 1
    MaxCores                         : 1
    MinSockets                       : 1
    MaxSockets                       : 1
    MinNodes                         : 1
    MaxNodes                         : 1
    RunUntilCanceled                 : False
    IsExclusive                      : False
    ErrorCode                        : -2147218980
    ErrorParams                      : 21092.3375.2576,21092.3375.2577
    State                            : Failed
    PreviousState                    : Finishing
    JobType                          : Batch
    Priority                         : Normal
    .....
    ErrorMessage                     : Task 21092.3375.2576,21092.3375.2577 failed. Please check the failed task for more details on the failure.

    And on the Job View from the Job Manager it looks like following:


    If i click on the blue marked Error Messages, they will point only to Errors which happend before the requeue,
    so they are old, and currently we don't see this as reason for the Current Problem.

    And if we Google after the Error Message "-2147218980" or in hex "0x800409DC" we don't find very usefull information.
    The only hint we got points to following page, but here we have no real explanation what this code means:

    https://msdn.microsoft.com/en-us/library/microsoft.hpc.scheduler.properties.errorcode.execution_taskfailure%28v=vs.85%29.aspx

    When we check the Excluded Nodes with the Command "(Get-HpcJob –JobId <yourJobID>).ExcludedNodes"
     -> There is no Excluded Node

    What can i do to get this Job Requeued? 
    Any suggestion would be great!

    Thank  you very much for your help.

    Thank you very much in advance,
    best regards,

    Bobby



    • Edited by Bobby013 Saturday, March 07, 2015 11:42 AM Improved Readability
    Saturday, March 07, 2015 10:18 AM

All replies

  • Hi Boy,

      We need more data for troubleshooting the issue. could you contact this email address? hpcpack@Microsoft.com?

      Especially we want to know the full result of job view <jobid> /detailed.

     Qiufang


    Qiufang Shi

    Thursday, March 12, 2015 2:19 AM