14 Januari 2008 14:35
mpiexec -machinefile machines.list -n 4 prog.exe
doesn't work until I manually copy the binary onto each node and specify the working directory where it is located by
parameter. machinefile contains the list of computer names I am willing to use. What am I doing wrong that the nodes and master (where the mpiexec is run) don't send any other information apart the MPI messages itselves among each other? I doubt that I would not have to copy all the files onto each file manually? Why the binary is not spread automatically, should I setup some tmp directory, variable or path?
Thank you for some help
ps: I am a Windows newbie, please be simple and "comprehensible".
16 Januari 2008 1:31Pemilik
You'll want to use the "working directory" argument and specify a UNC path to the file share.
Assuming that your are submitting your mpi job via the Compute Cluster Server "Job Submission Console", the working directory environment variable is added automatically to the mpiexe command line. Note the "working directory" entry within the task configuration dialog Window.
The resultant mpiexec command line becomes (only showing the added wdir argument):
mpiexec -wdir \\MyHeadNode\c$\MySharedFolder
Hope that helps...
16 Januari 2008 8:20
Thanks a lot, Phil!
It helped me to run the code on the cluster.
Now I challenge another problem: Running jobs using just mpiexec works well but when I try to use WCCS routine "job" the job failed. When I use Intel Cluster Toolkit mpiexec it failes immediately, if I use WCCS's mpiexec, it also failes. I've read the WCCS job help and ran it according to it. Do I need to define or initiate some extra scheduler or job manager?
prog.exe is compiled by Intel compiler (C++),
mpi library from Intel. I_MPI_ROOT is set to c:\program files (x86)\intel\ictce\3.1\mpi\3.1
C:\cluster_share is shared over the nodes.
C:\cluster_share>job submit /numprocessors:2 /askednodes:master,node1 /stdin:\\master\cluster_share\input.txt /stdout:\\master\cluster_share\output.txt "%I_MPI_ROOT%\em64t\bin\mpiexec.exe" -hosts %CCP_NODES% \\master\cluster_share\prog.exe
Job has been submitted. ID: 731.
C:\cluster_share>job view 731
Job ID : 731
Status : Failed
Name : CLUSTER\Administrator:Jan 15 2008 11:20AM
Submitted by : CLUSTER\Administrator
Number of processors : 2-2
Allocated nodes : MASTER
Submit time : 1/15/2008 11:20:48 AM
Start time : 1/15/2008 11:20:48 AM
End time : 1/15/2008 11:20:49 AM
Number of tasks : 1
Notsubmitted : 0
Queued : 0
Running : 0
Finished : 0
Failed : 1
Cancelled : 0
Suprisingly, allocated nodes don't include NODE1. Always reads only the first machine. But if I ommit the /asknodes parameter and specify the machines in mpiexec using e.g. -machinefile list.txt it also doesn't help.
Please, don't you have any idea why? Is there a more detailed error log somewhere why it failes?
18 Januari 2008 0:54Pemilik
There is a "logfiles" folder within the Compute Cluster Pack install directory.
You might also specify a /stderr: parameter to the Job Submit command and perhaps collect additional status information.
Try inserting the escape character (^) within the embedded environment variables as follows:
job submit /numprocessors:2 /askednodes:master,node1 /stdin:\\master\cluster_share\input.txt /stdout:\\master\cluster_share\output.txt "^%I_MPI_ROOT^%\em64t\bin\mpiexec.exe" -hosts ^%CCP_NODES^% \\master\cluster_share\prog.exe
See the "Compute Cluster User's Guide" installed with the Compute Cluster Pack and the "Use Environment Variables" topic.
Let us know if that doesn't work!
- Ditandai sebagai Jawaban oleh Don PatteeModerator 25 Juni 2009 22:37
04 Maret 2008 1:49
Martin, do you have access to the cluster nodes, i.e. can you remote desktop into them? If so, one of the first things I do when a job fails to run is remote desktop into one of the compute nodes, and run the job manually via a command window. Type the command line as you specified it in the job, and see what happens.
Once I know the command works, then any job problems are submission-oriented, and not related to the program itself.