Hi,
I'm trying to run an MPI application using several compute nodes of an HPC cluster. I've copied the MPI application locally to each compute node and I would think that I should run it like this:
job submit /scheduler:my_head_node /jobtemplate:my_job_template /jobname:TransformationEngine-MPI /stdout:c:\temp\stdout.txt /stderr:c:\temp\stderr.txt /numnodes:2 mpiexec -n 1 c:\dev\mpi_job.exe
I would expect the job scheduler to start one instance of mpi_job.exe in each one of the compute nodes, at least in two of them. And I would expect both of them to be able to communicate. However, only one task is started even if two compute nodes are allocated.
I've also tried to start the job like this:
job new /scheduler:my_head_node /jobtemplate:my_job_template /jobname:TransformationEngine-MPI /numnodes:2
job add my_job_id mpiexec -n 1 c:\dev\mpi_job.exe
job add my_job_id mpiexec -n 1 c:\dev\mpi_job.exe
job submit /id:my_job_id
And in this case two MPI tasks are started, but both of them get rank 0, and don't communicate with each other.
Finally, I've also tried to run the application like this:
job submit /scheduler:my_head_node /jobtemplate:my_job_template /jobname:TransformationEngine-MPI /stdout:c:\temp\stdout.txt /stderr:c:\temp\stderr.txt /numnodes:2 mpiexec -hosts 2 hotsname_1 hostname_2 c:\dev\mpi_job.exe
And in this case the tasks failed saying that the host allocation was managed by the job scheduler, so I couldn't override it with mpiexec (I don't have the exact error message now, someone else is hogging the cluster).
I don't know what else to try, any ideas anyone?
Thanks,
Alberto