We are seeing intermittent failed job and are looking for some troubleshooting strategies to figure out what the issue is. The jobs are the same with identical test jobs doing some fairly simple computations using MATLAB. During a 12 hour test run more than 2000 will succeed but 6 – 10 will fail.
There are no errors in the windows application or system logs. MATLAB indicates that it was unable to access / create the job directory on the headnode but there is lots of space. Task view just gives a failed status and an exit code of 128.
Is there a good way to chase this down? Can we increase logging somewhere to get more details?