Requeue a task after failure RRS feed

  • Question

  • We have HPC R2 SP1 running. The app we run on the cluster creates and submits a job to the cluster by creating and submitting a XML file. For some reason, the some tasks fail sometimes with no information or error handling by the job itself or any error on the compute node or head node logs about why it may have failed. The same node that failed the tasks may run other tasks in the same job successfully. So in say 1000 tasks you may have 1 failure causing the entire job to fail. Is there any way through command line or job templates to configure the cluster so that it retries a task a automatically once or twice before it marks it as failed?

    thank you

    Friday, July 1, 2011 3:23 PM