none
Exit code 0

    Question

  • Hi there,

    we are trying to setup a testing environment with Windows HPC 2008 Cluster three compute-nodes (for now) and a head-node. All nodes have the evaluation licence key.
    All nodes are shown in the HPC Manager/Node Management with Online and Health OK state.
    As test program we wanted to use the batchpi.exe from the files section, so we made a new job.
    Job Details: Name = TEST1, Template = Default, Project = TEST, Priority = Normal, RunOptions = nothing checked,
    Resources = Node/Auto/Auto
    Task List: My Task, c:\batchpi.exe 1000000000, all folders are set to c:\
    Resource Selection: Run this job only on nodes 1, 2, 3, not the head-node
    <Submit>
    The job starts on the nodes and ends a second later with a failed job message: Task failed during execution with exit code 0. Please check task's output for error details.

    In the command line (cmd) it runs without problems. We also tried to run the job with mpiexec -n * c:\batchpi.exe 1000000000 and copied the batchpi.exe to the compute-nodes, but nothing helps :(

    What are we doing wrong? Please help.

    Regards
    Volkie
    lundi 23 novembre 2009 12:51

Réponses

  • Hi Volkie,

    In the job XML you provided above, it specifies the following:

    StdOutFilePath="C:\" StdInFilePath="C:\" StdErrFilePath="C:\"
    However, the StdOutFilePath, StdInFilePath, and StdErrFilePath elements should each specify full paths to a file, not paths to a directory. The following might work better for your job xml:
    StdOutFilePath="C:\batchpi.stdout.txt" StdInFilePath="" StdErrFilePath="C:\batchpi.stderr.txt"
    Regards,

    Patrick
    mercredi 9 décembre 2009 01:21

Toutes les réponses

  • It's hard to tell from your description exactly what your job looks like, but let me see if I can help. You should have one task, and its command line should be "mpiexec C:\batchpi.exe 1000000000" (I'm assuming that you have copied batchpi.exe to the C:\ directory on all of the nodes).  You should also make sure that your task's Min and Max is set to the number of cores that you want to use.

    Does anything show up in your output or error files?

    If none of that helps, please post the XML for your job up here as it may give us some insights into the problem.

    Thanks,
    Josh


    -Josh
    lundi 23 novembre 2009 20:26
  • Hello again,

    it doesnt even work, if i submit a job that should run only on the headnode. Have a look in the XML file please:

    <?xml version="1.0" encoding="utf-8"?>
    <Job Version="2.000" Id="22" Name="TEST16" SubmitTime="24.11.2009 09:35:45" CreateTime="24.11.2009 09:34:53" StartTime="24.11.2009 09:35:45" EndTime="24.11.2009 09:35:46" ChangeTime="24.11.2009 09:34:53" UnitType="Core" MinCores="1" MaxCores="12" MinSockets="1" MaxSockets="1" MinNodes="1" MaxNodes="1" RunUntilCanceled="false" IsExclusive="true" ErrorCode="-2147218980" ErrorParams="22.1" State="Failed" PreviousState="Running" UserName="SCHWEINECLUSTER\Administrator" JobType="Batch" Priority="Normal" RequiredNodes="" IsBackfill="false" NextTaskNiceID="2" HasGrown="false" HasShrunk="false" OrderBy="" TaskLevelUpdateTime="24.11.2009 09:35:45" MinMaxUpdateTime="24.11.2009 09:35:45" ComputedMinCores="1" ComputedMaxCores="12" RequestCancel="None" RequeueCount="0" AutoRequeueCount="0" FailureReason="None" PendingReason="None" AutoCalculateMax="false" AutoCalculateMin="false" ParentJobId="0" ChildJobId="0" NumberOfCalls="0" NumberOfOutstandingCalls="0" CallDuration="0" CallsPerSecond="0" FailOnTaskFailure="false" Preemptable="true" ProjectId="14" JobTemplateId="1" OwnerId="3" ClientSourceId="3" Project="TEST" JobTemplate="Default" DefaultTaskGroupId="23" Owner="SCHWEINECLUSTER\Administrator" ClientSource="HpcClusterManager" xmlns="http://schemas.microsoft.com/HPCS2008/scheduler/">
        <Dependencies />
        <Tasks>
            <Task Version="2.000" Id="30" SubmitTime="24.11.2009 09:35:45" CreateTime="24.11.2009 09:35:45" StartTime="24.11.2009 09:35:46" EndTime="24.11.2009 09:35:46" ChangeTime="24.11.2009 09:35:45" ErrorCode="-2147218979" ErrorParams="0" State="Failed" PreviousState="Finishing" ParentJobId="22" RequestCancel="None" Closed="false" RequeueCount="3" AutoRequeueCount="3" FailureReason="ResourceFailure" PendingReason="None" InstanceId="0" Output="" RecordId="30" Name="MyTask" MinCores="1" MaxCores="12" MinSockets="1" MaxSockets="1" MinNodes="1" MaxNodes="1" NiceId="1" CommandLine="mpiexec.exe C:\batchpi.exe 1000000000 > batchpi.txt" WorkDirectory="C:\" StdOutFilePath="C:\" StdInFilePath="C:\" StdErrFilePath="C:\" HasCustomProps="false" IsParametric="false" GroupId="23" ParentJobState="Failed" UnitType="Core" ParametricRunningCount="0" ParametricCanceledCount="0" ParametricFailedCount="0" ParametricQueuedCount="0" />
        </Tasks>
    </Job>
    I inserted the > after the batchpi command so that i can see if its executed and it isn't, it does not create the txt file, so batchpi is not executed after submittin the job.  Could it be a problem if the head node is the domain controller? Something about permission or user accounts?

    Thanks
    Volkie
    mardi 24 novembre 2009 10:14
  • Hi Volkie,

    In the job XML you provided above, it specifies the following:

    StdOutFilePath="C:\" StdInFilePath="C:\" StdErrFilePath="C:\"
    However, the StdOutFilePath, StdInFilePath, and StdErrFilePath elements should each specify full paths to a file, not paths to a directory. The following might work better for your job xml:
    StdOutFilePath="C:\batchpi.stdout.txt" StdInFilePath="" StdErrFilePath="C:\batchpi.stderr.txt"
    Regards,

    Patrick
    mercredi 9 décembre 2009 01:21
  • Hi Patrick,

    there was the problem, thank you very much. Maybe i should start open my eyes from time to time :)
    From the three compute nodes, one just gave up and died. System just freezed, the rest is currently
    computing at 100% cpu usage for 4 days by now. So it's working! WEEEEEEE :)

    Regards,

    Volkie
    mardi 15 décembre 2009 11:56