locked
Immediate Job Failure with HPC Job Manager RRS feed

  • Question

  • I'm using Windows HPC Pack 2008 with the cluster running HPC Server 2008.

    When I try to submit jobs using HPC Job Manager 2008, I get immediate failures with no reason given for my failures.

    I have tried with both a parametric non-MPI task and a normal MPI task to no avail.  The problem I think I am having (but then again, I have no clue), is that I am having trouble understanding what the working directory represents.  I read that this directory needs a Universal Naming Convention which makes me believe that the directory I give it needs to be shared with the cluster.  All I have been puting in the working directory before is the path to where my executables are, which may or may not be correct.  Once again, I dont know if this is the trouble I am having, this is just my guess.

    Any help is appreciated,

    Thanks
    Monday, June 1, 2009 7:52 PM

Answers

  • AJ,
    If you're jobs are failing, you shoudl be able to see an error message.  Try double-clicking the failed job in the UI and looking at the "Results" page.

    Your Working Directory is the path where your job will be started.  One way to think of it is that on the compute node where your jobs runs, the system will do "cd <your working directory>" and then "cmd.exe /c <your command line>".  So the working directory path needs to be something that would be acccessible from any compute node where your job will run.

    Some examples woudl be:
    C:\Program Files\MyApp\ - This would use the local directory on each compute node, and thereby assumes that your command line, input, and output file paths are available relative to this path on all the machines (i.e., this would be a good choice if your application is locally installed on every machine in the cluster)
    \\someserver\someshare\ - This would connect to a share, and is an excellent choice if your applications or data files are stored on a file server

    Working Directory is always optional . . . by default the system will use %USERPROFILE% (C:\Users\Username).  This is great if your input and output files are fully paths provided, or if your applicatin is in the PATH on each machine.

    Thanks,
    Josh

    -Josh
    • Proposed as answer by Josh BarnardModerator Monday, June 1, 2009 8:54 PM
    • Unproposed as answer by AJ Fret Tuesday, June 2, 2009 5:40 PM
    • Marked as answer by AJ Fret Tuesday, June 2, 2009 5:40 PM
    Monday, June 1, 2009 8:54 PM
    Moderator

All replies

  • AJ,
    If you're jobs are failing, you shoudl be able to see an error message.  Try double-clicking the failed job in the UI and looking at the "Results" page.

    Your Working Directory is the path where your job will be started.  One way to think of it is that on the compute node where your jobs runs, the system will do "cd <your working directory>" and then "cmd.exe /c <your command line>".  So the working directory path needs to be something that would be acccessible from any compute node where your job will run.

    Some examples woudl be:
    C:\Program Files\MyApp\ - This would use the local directory on each compute node, and thereby assumes that your command line, input, and output file paths are available relative to this path on all the machines (i.e., this would be a good choice if your application is locally installed on every machine in the cluster)
    \\someserver\someshare\ - This would connect to a share, and is an excellent choice if your applications or data files are stored on a file server

    Working Directory is always optional . . . by default the system will use %USERPROFILE% (C:\Users\Username).  This is great if your input and output files are fully paths provided, or if your applicatin is in the PATH on each machine.

    Thanks,
    Josh

    -Josh
    • Proposed as answer by Josh BarnardModerator Monday, June 1, 2009 8:54 PM
    • Unproposed as answer by AJ Fret Tuesday, June 2, 2009 5:40 PM
    • Marked as answer by AJ Fret Tuesday, June 2, 2009 5:40 PM
    Monday, June 1, 2009 8:54 PM
    Moderator
  • Alright,

    You helped me fix my previous problem, but I have another that is probably along the same lines.  I'm currently getting the error...

     "Error (14001) The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log for more detail."

    After some digging, this error could occur when I'm not using the release version of my code and the required dlls are not tied to it. (Once again, I don't know if this is the problem).

    If this is the problem, how do I go about fixing it? and if it is not, any other help is appreciated.

    Thanks again,

    AJ
    Tuesday, June 2, 2009 5:40 PM