We have HPC R2 SP1 running. The app we run on the cluster creates and submits a job to the cluster by creating and submitting a XML file. For some reason, the some tasks fail sometimes with no information or error handling by the job itself or any error
on the compute node or head node logs about why it may have failed. The same node that failed the tasks may run other tasks in the same job successfully. So in say 1000 tasks you may have 1 failure causing the entire job to fail. Is there any
way through command line or job templates to configure the cluster so that it retries a task a automatically once or twice before it marks it as failed?
thank you