locked
Shouldn't there be a distinction between a JobState of Failed and a non-zero exit code? RRS feed

  • General discussion

  • I had expected a JobState of Failed to indicate that there was an error with the job which would mean in its current state would never succeed because there was some sort of environment/infrastructure isssue (e.g. permissions, invalid stdout filename).

     

    I hadn't expected a JobState of Failed if the Job's task succeeded but returned a non-zero exit code.

     

    Andy

    Monday, April 14, 2008 12:17 PM

All replies

  •  

    Andy,

    We've had this issue brought up a few times before.  The current implementation is designed to bubble up any problems as a failure, even though the cause may be different in different cases.  The "failure message" field should provide details on what exactly caused the problem.

     

    We have taken of philosophy of trying to limit the number of states that we have in the system . . . so while I agree that "Failed due to infrastructure" vs. "failed due to execution failure" might be interesting to differentiate, it would add some complexity to the system (especially since it's not always possible to tell the difference between those cases).

     

    I've got that issue filed in our issue tracking system, so we'll bring it up for discussion.  But if this were to change, I don't expect it would happen until after HPC2008.

     

    Thanks,
    Josh

    Monday, April 14, 2008 9:06 PM
    Moderator
  •  Barndawgie wrote:

    The current implementation is designed to bubble up any problems as a failure, even though the cause may be different in different cases. 

     

    IMHO there hasn't been a failure.

    The requested command ran successfully to completion it just return an ExitCode (which I can check for independantly).

     

     Barndawgie wrote:

    The "failure message" field should provide details on what exactly caused the problem.

     

    If the command is 'EXIT 42' which is a valid command but returns an exit code of 42 for example I get an ErrorMessage of:

     

    Code Snippet

    Error message Task failed during execution. Please check task's output for error details.

     

     

    If the command is 'no-such-command.exe' I get exactly the same error  which in a live environment means the difference between a catastrophic failure - the app wasn't installed - and an exit code indicating some sort of state.

     

     Barndawgie wrote:

    We have taken of philosophy of trying to limit the number of states that we have in the system . . . so while I agree that "Failed due to infrastructure" vs. "failed due to execution failure" might be interesting to differentiate, it would add some complexity to the system (especially since it's not always possible to tell the difference between those cases).

    I agree that if that's what I was asking for a new state it would indeed add complexity ... however ... I don't think I am

    I'm just asking that you could perhaps ignore the exit code when indicating a failure.

     

     Barndawgie wrote:

    I've got that issue filed in our issue tracking system, so we'll bring it up for discussion.  But if this were to change, I don't expect it would happen until after HPC2008.

     

    I guess another workaround is in order

    Tuesday, April 15, 2008 4:54 PM