none
distributing binaries onto nodes RRS feed

  • Question

  •  

    Hello,

     

    mpiexec -machinefile machines.list -n 4 prog.exe

     

    doesn't work until I manually copy the binary onto each node and specify the working directory where it is located by

     

    -gwdir

     

    parameter. machinefile contains the list of computer names I am willing to use. What am I doing wrong that the nodes and master (where the mpiexec is run) don't send any other information apart the MPI messages itselves among each other? I doubt that I would not have to copy all the files onto each file manually? Why the binary is not spread automatically, should I setup some tmp directory, variable or path?

     

    Thank you for some help

     

    Martin

     

    ps: I am a Windows newbie, please be simple and "comprehensible".

    Monday, January 14, 2008 2:35 PM

Answers

  • Hi Martin,

     

    There is a "logfiles" folder within the Compute Cluster Pack install directory.  

     

    You might also specify a /stderr: parameter to the Job Submit command and perhaps collect additional status information.

     

    Try inserting the escape character (^) within the embedded environment variables as follows:

     

    job submit /numprocessors:2 /askednodes:master,node1 /stdin:\\master\cluster_share\input.txt /stdout:\\master\cluster_share\output.txt "^%I_MPI_ROOT^%\em64t\bin\mpiexec.exe" -hosts ^%CCP_NODES^% \\master\cluster_share\prog.exe

     

    See the "Compute Cluster User's Guide" installed with the Compute Cluster Pack and the "Use Environment Variables" topic.

     

    Let us know if that doesn't work!

     

    Phil

    Friday, January 18, 2008 12:54 AM

All replies

  • Hi Martin,

     

    You'll want to use the "working directory" argument and specify a UNC path to the file share.  

     

    Assuming that your are submitting your mpi job via the Compute Cluster Server "Job Submission Console", the working directory environment variable is added automatically to the mpiexe command line.   Note the "working directory" entry within the task configuration dialog Window.

     

    The resultant mpiexec command line becomes (only showing the added wdir argument):

     

    mpiexec -wdir \\MyHeadNode\c$\MySharedFolder

     

    More info here:  http://technet2.microsoft.com/windowsserver/en/library/7876c216-b704-473c-b97f-e8a15c67551b1033.mspx?mfr=true

     

    Hope that helps...

     

    Phil

     

     

    Wednesday, January 16, 2008 1:31 AM
  • Thanks a lot, Phil!

     

    It helped me to run the code on the cluster.

     

    Now I challenge another problem: Running jobs using just mpiexec works well but when I try to use WCCS routine "job" the job failed. When I use Intel Cluster Toolkit mpiexec it failes immediately, if I use WCCS's mpiexec, it also failes. I've read the WCCS job help and ran it according to it. Do I need to define or initiate some extra scheduler or job manager?

     

    prog.exe is compiled by Intel compiler (C++),

    mpi library from Intel. I_MPI_ROOT is set to c:\program files (x86)\intel\ictce\3.1\mpi\3.1

    C:\cluster_share is shared over the nodes.

     

    C:\cluster_share>job submit /numprocessors:2 /askednodes:master,node1 /stdin:\\master\cluster_share\input.txt /stdout:\\master\cluster_share\output.txt "%I_MPI_ROOT%\em64t\bin\mpiexec.exe" -hosts %CCP_NODES% \\master\cluster_share\prog.exe


    Job has been submitted. ID: 731.

    C:\cluster_share>job view 731
    Job ID               : 731
    Status               : Failed
    Name                 : CLUSTER\Administrator:Jan 15 2008 11:20AM
    Submitted by         : CLUSTER\Administrator
    Number of processors : 2-2
    Allocated nodes      : MASTER
    Submit time          : 1/15/2008 11:20:48 AM
    Start time           : 1/15/2008 11:20:48 AM
    End time             : 1/15/2008 11:20:49 AM
    Number of tasks      : 1
        Notsubmitted     : 0
        Queued           : 0
        Running          : 0
        Finished         : 0
        Failed           : 1
        Cancelled        : 0

    Suprisingly, allocated nodes don't include NODE1. Always reads only the first machine. But if I ommit the /asknodes parameter and specify the machines in mpiexec using e.g. -machinefile list.txt it also doesn't help.

     

    Please, don't you have any idea why? Is there a more detailed error log somewhere why it failes?

     

    Best regards,

    Martin

     

    Wednesday, January 16, 2008 8:20 AM
  • Hi Martin,

     

    There is a "logfiles" folder within the Compute Cluster Pack install directory.  

     

    You might also specify a /stderr: parameter to the Job Submit command and perhaps collect additional status information.

     

    Try inserting the escape character (^) within the embedded environment variables as follows:

     

    job submit /numprocessors:2 /askednodes:master,node1 /stdin:\\master\cluster_share\input.txt /stdout:\\master\cluster_share\output.txt "^%I_MPI_ROOT^%\em64t\bin\mpiexec.exe" -hosts ^%CCP_NODES^% \\master\cluster_share\prog.exe

     

    See the "Compute Cluster User's Guide" installed with the Compute Cluster Pack and the "Use Environment Variables" topic.

     

    Let us know if that doesn't work!

     

    Phil

    Friday, January 18, 2008 12:54 AM
  • Martin, do you have access to the cluster nodes, i.e. can you remote desktop into them?  If so, one of the first things I do when a job fails to run is remote desktop into one of the compute nodes, and run the job manually via a command window.  Type the command line as you specified it in the job, and see what happens.

     

    Once I know the command works, then any job problems are submission-oriented, and not related to the program itself.

     

    Cheers,

     

      - joe

     

    Tuesday, March 4, 2008 1:49 AM