none
Job stuck in 'Finishing' state

    Question

  • We had a job running a parallel computation on 5 machines which failed somewhere in the middle due to issues related to our network and network storage.  Each machine has its own Task for running the computation.  3 machines 'Failed' but two are stuck in 'Running.'  However, there's an error message "Process has exited" displayed in the Error Message column, and if I try to cancel the individual broken task it changes to "Canceled by user," but is stuck in the 'Running' state.  Additionally, I had requested a 6th node, which never actually was used in the computation.  The job fails with a "Node NODE6 became unreachable. Ensure that all nodes are available and submit the job again."

    The job's overall state is listed as 'Finishing'.

    I've tried cancelling the job, but it doesn't let you cancel a job in the 'Finishing' state. As I said before, I've also tried cancelling the individual broken tasks.

    Furthermore, if I try to add another job in the meantime, I get the following error: "The value for 'Password' is too large for the database. The size should be smaller than 0."

    We're running server version 2.1.1703.0

    Any ideas?

    TIA

    Thursday, August 12, 2010 8:47 PM

Answers