locked
Troubleshooting job failures RRS feed

  • Question

  • We are seeing intermittent failed job and are looking for some troubleshooting strategies to figure out what the issue is.  The jobs are the same with identical test jobs doing some fairly simple computations using MATLAB.  During a 12 hour test run more than 2000 will succeed but 6 – 10 will fail.

    There are no errors in the windows application or system logs.  MATLAB indicates that it was unable to access / create the job directory on the headnode but there is lots of space.  Task view just gives a failed status and an exit code of 128.

    Is there a good way to chase this down?  Can we increase logging somewhere to get more details?

    Monday, August 25, 2008 10:45 PM

Answers

  • Basically what is happening is that the the scheduler is succesfully starting the job, but then the MATLAB executable is returning exit code 128.

    You said MATLAB was unable to access/create a job directory . . . where are you seeing this error?  This could be do to a number of things, including intermittent network failures, file share/network connection limits on the HN, etc . . .

    Because the failure is in MATLAB, you'll need to consult your MATLAB documentation for details on how to increase MATLAB logging; perhaps someone else on the forum has some experience with this?

    Thanks,
    Josh
    -Josh
    Wednesday, August 27, 2008 5:24 PM
    Moderator