locked
Help running Linpack on WCCS 2003 RRS feed

  • Question

  • Hello,

    I followed the instructions on porting HPL to Windows Compute Cluster Server 2003 found on WindowsHPC.net. However, when I tried to execute the benchmark, there was an error. I'm not really sure what was the problem, since the Job Scheduler just shows "Failed" on the job status, and give no further information. I didn't find any log files on C:\Program Files\Microsoft Compute Cluster Pack\LogFiles, and there wasn't any output in either the standard output nor the standard error files.


    When I tried typing "view task #n" on the command line I got a little more information, though not really helpful (pasted below):

    C:\scratch>task view 63
    Task ID              : 63.1
    Status               : Failed
    Name                 : hpl
    Command line         : mpiexec -n 6 xhpl.exe
    Allocated nodes      : NODE2 NODE3 NODE1
    Exit code            : 128
    Submit time          : 13/2/2008 14:26:05
    Start time           : 13/2/2008 14:26:05
    End time             : 13/2/2008 14:26:06
    Kernel time          : 0,078
    User time            : 0,015
    Working set          : 9364 KB


    I searched the web for information about this error without any sucess..

    I tried both using the command line and the Compute Cluster Job Manager for submitting the job, but the Job Manager only gives information about the job failing, and not a single detail about the reason... I've saved a template for the job with the Job Scheduler Console and it looks like that:

     

    <?xml version="1.0" encoding="utf-8"?>
    <Job xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" SoftwareLicense="" MaximumNumberOfProcessors="6" MinimumNumberOfProcessors="6" Runtime="Infinite" IsExclusive="true" Priority="Normal" Name="HPL" Project="" RunUntilCanceled="false">
    <Tasks xmlns="http://www.microsoft.com/ComputeCluster/">
    <Task MaximumNumberOfProcessors="6" MinimumNumberOfProcessors="6" Depend="" WorkDirectory="\\headnode\scratch\" Stdout="hpl.out" Stderr="hpl.err" Name="hpl" CommandLine="mpiexec -np 6 xhpl.exe" IsExclusive="false" IsRerunnable="true" Runtime="Infinite">
    <EnvironmentVariables />
    </Task>
    </Tasks>
    </Job>

    I know the cluster is running properly, since all the nodes are ready on the monitor and I already executed the sample MPI program "BatchPI" on it (also found on windowshpc.net), and commands like clusrun are all ok.

    I'd really apreciate if you could give me some advice. :-)

    Thanks,

    Danilo

    Wednesday, February 13, 2008 4:32 PM

Answers

  • You should run “Task view 63.1” to view the details for the task, which will include the error message for the task.  The error message will probably be “Task failed during execution” or the like, since it has a non-zero exit code.  What you really need to do is figure out why his application returned exit code 128 and what that means.  You should look at the StdOut/StdErr for his task as the most likely source of information.

     

    Feel free to post any of that output back up here and we'll try to help you debug.

     

    Thanks!
    Josh

    Friday, February 15, 2008 8:17 PM
    Moderator
  • Hi, Josh

     

    My biggest headache was finding no information about the error. There was nothing on CCP logs neither on the standard error file nor on the standard output file. However I exchanging e-mails with PhilPenn i was able to find the problem.

    It was a kinda silly mistake. I had forgotten to install MKL on all compute nodes, so there was no way Linpack would run.

    I was a little misguided because Linpack did run on the headnode when I tried to run it manually with "smpd -d" and mpiexec.

     

    Anyway, thanks for your attention.

     

    Monday, February 18, 2008 6:14 PM

All replies

  • You should run “Task view 63.1” to view the details for the task, which will include the error message for the task.  The error message will probably be “Task failed during execution” or the like, since it has a non-zero exit code.  What you really need to do is figure out why his application returned exit code 128 and what that means.  You should look at the StdOut/StdErr for his task as the most likely source of information.

     

    Feel free to post any of that output back up here and we'll try to help you debug.

     

    Thanks!
    Josh

    Friday, February 15, 2008 8:17 PM
    Moderator
  • Hi, Josh

     

    My biggest headache was finding no information about the error. There was nothing on CCP logs neither on the standard error file nor on the standard output file. However I exchanging e-mails with PhilPenn i was able to find the problem.

    It was a kinda silly mistake. I had forgotten to install MKL on all compute nodes, so there was no way Linpack would run.

    I was a little misguided because Linpack did run on the headnode when I tried to run it manually with "smpd -d" and mpiexec.

     

    Anyway, thanks for your attention.

     

    Monday, February 18, 2008 6:14 PM
  • Do you think there is something we could have done differently in Compute Cluster Server to help you identify this problem? It's a little tricky, to be sure, but any feedback would be appreciated.

     

    Thank you,

    ryan

     

    Tuesday, February 19, 2008 9:44 PM
  • Yes, Ryan, thanks for your attention.

     

    First of all, I was a little lost because the only information I had was "The job submission failed"

    It made me think the error was in the Job Scheduling.. So I lost a lot of time trying different kinds of submissions.

     

    Where are the error messages? Where are the log files? I searched for them, but the error file and output file were empty. Then, I looked for the Compute Cluster Pack logs, but there wasn't any either.

     

    I don't know if that can be changed easily, or if that problem was really because the job wasn't submitted (i think it was an execution problem, wasn't it?), but I found that very hard to debug. And worst of all, I'm not really used to Windows Server, so I didn't know where to search for the system logs.

     

    So, if there were a little more information about the errors, I think it would be a lot easier to solve these kind of problems.

     

    Thanks,

    Danilo

    Wednesday, February 20, 2008 5:50 PM
  • Good feedback, thank you. We'll look at what we can do to improve failure diagnosis in our next release.

    Wednesday, February 20, 2008 6:15 PM
  • Hi
    I am facing the issue with running linpack on winhpc 2008

    I am able to compile the linpack using the below instruction ,

    http://code.msdn.microsoft.com/How2BuildLinpack/Release/ProjectReleases.aspx?ReleaseId=1439




    Once we Run Linpack on a development workstation using the HPC Server Job Scheduler (mpiexec -n 2 \\192.168.1.6\Dump\Input\xhpl.exe) , below error noticed in the error file

    Attempting to use an MPI routine before initializing MPICH

    Attempting to use an MPI routine before initializing MPICH

    Could you please guide us to correct contact in Microsoft .



    Regards

    Deepak

    Tuesday, November 18, 2008 9:52 AM