Answered by:
Job Submit failed

Question
-
When I use HPC Cluster Manager job submit and get a error message:
"The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log for more detail"
But when I use command line it is successful.
link picture :
http://img72.imageshack.us/img72/6343/errorh.jpg
How can I solve this problem 1?
download the code :
http://www.xun6.com/file/50ad30938/SARMPI.cpp.html
And where do I found the "application event log " !?Wednesday, February 24, 2010 3:03 PM
Answers
-
There is a deadlock in your code so the job still hang.
Looking at the code in the if...else... block when myid == 0
if
(i%4 == 0){
phi = i-180;
}
else {
......
MPI_Send ((void *)&phi, 1, MPI_INT, j, itag, MPI_COMM_WORLD);
}
Note that the MPI_Recv is a blocking call. There is no MPI_Send() when i%4 == 0. However, there is always the call of MPI_Recv() when myid != 0. So deadlock happened:
1. No MPI_Send() for root (myid = 0) since i%4 == 0;
2. Rank other than root will call MPI_Recv which is suppose to send from root
To solve the problem, you need redesign your code to make sure the root process called MPI_Send() before the MPI_Recv call of other processes.
Another suggestion, don't use the hard coded value for number of ranks in your cluster, using numprocs instead. (in the loop of send, you better use for (j = 1; j < numprocs; j++)).
Thanks,
James
Thank you !!
I will try your way , and than report to you.
the problem solve by redesign code :
for (i = 0; i <= 360; i++)- Marked as answer by YuJinSu Wednesday, March 3, 2010 12:32 PM
Monday, March 1, 2010 7:12 AM
All replies
-
Hi Yujin,
the error code is from missed VC CRT library on some or all of the compute nodes.
suggestions:
1) logon to each compute node, run your program from cmdline directly: mpiexec -n 2 SARMPI.exe. If this pass, then go ahead to run from job scheduler.
2) or you can use clusrun to check which compute node is ok or not: clusrun mpiexec -n 2 SARMPI.exe
hope it helps,
LiweiThursday, February 25, 2010 3:50 AM -
I remember the first time submit job will be asked to enter account password, but there is no.
Is it reason ?
http://img444.imageshack.us/img444/5605/123yf.jpgThursday, February 25, 2010 7:50 AM -
Hi Yujin,
the error code is from missed VC CRT library on some or all of the compute nodes.
suggestions:
1) logon to each compute node, run your program from cmdline directly: mpiexec -n 2 SARMPI.exe. If this pass, then go ahead to run from job scheduler.
2) or you can use clusrun to check which compute node is ok or not: clusrun mpiexec -n 2 SARMPI.exe
hope it helps,
Liwei
When I install Microsoft Visual Studio 2008 on my computer node that can be solve the problem.
logon to each compute node, run your program from cmdline directly: mpiexec -n 2 SARMPI.exe is successful.
But whtn I use HPC Cluster Manager , its state always running .
How can I slove the problem !?
http://img40.imageshack.us/img40/6138/picturebc.jpgThursday, February 25, 2010 11:47 AM -
The MPI job is hang so the job state keep as running. I was wondering how it can succeed with "mpiexec -n 2 SARMPI.exe "
.I just took a quick look of your source code, at least one point will cause hang:
if (myid == 0)
{
doSomething();
}
else
{
doSomethingElse();
MPI_Barrier(MPI_COMM_WORLD);
}
The MPI_Barrier will not return until all the MPI processes reached here. However, there is no way that rank 0 can reach this else block.
Remove it and try again.
Thanks,
JamesThursday, February 25, 2010 6:11 PM -
The MPI job is hang so the job state keep as running. I was wondering how it can succeed with "mpiexec -n 2 SARMPI.exe "
.I just took a quick look of your source code, at least one point will cause hang:
if (myid == 0)
{
doSomething();
}
else
{
doSomethingElse();
MPI_Barrier(MPI_COMM_WORLD);
}
The MPI_Barrier will not return until all the MPI processes reached here. However, there is no way that rank 0 can reach this else block.
Remove it and try again.
Thanks,
James
ths !!
But I try remove MPI_Barrier(MPI_COMM_WORLD);
The job state keep as runnung .... >"<Friday, February 26, 2010 2:43 AM -
There is a deadlock in your code so the job still hang.
Looking at the code in the if...else... block when myid == 0
if
(i%4 == 0){
phi = i-180;
}
else {
......
MPI_Send ((void *)&phi, 1, MPI_INT, j, itag, MPI_COMM_WORLD);
}
Note that the MPI_Recv is a blocking call. There is no MPI_Send() when i%4 == 0. However, there is always the call of MPI_Recv() when myid != 0. So deadlock happened:
1. No MPI_Send() for root (myid = 0) since i%4 == 0;
2. Rank other than root will call MPI_Recv which is suppose to send from root
To solve the problem, you need redesign your code to make sure the root process called MPI_Send() before the MPI_Recv call of other processes.
Another suggestion, don't use the hard coded value for number of ranks in your cluster, using numprocs instead. (in the loop of send, you better use for (j = 1; j < numprocs; j++)).
Thanks,
JamesFriday, February 26, 2010 7:01 AM -
There is a deadlock in your code so the job still hang.
Looking at the code in the if...else... block when myid == 0
if
(i%4 == 0){
phi = i-180;
}
else {
......
MPI_Send ((void *)&phi, 1, MPI_INT, j, itag, MPI_COMM_WORLD);
}
Note that the MPI_Recv is a blocking call. There is no MPI_Send() when i%4 == 0. However, there is always the call of MPI_Recv() when myid != 0. So deadlock happened:
1. No MPI_Send() for root (myid = 0) since i%4 == 0;
2. Rank other than root will call MPI_Recv which is suppose to send from root
To solve the problem, you need redesign your code to make sure the root process called MPI_Send() before the MPI_Recv call of other processes.
Another suggestion, don't use the hard coded value for number of ranks in your cluster, using numprocs instead. (in the loop of send, you better use for (j = 1; j < numprocs; j++)).
Thanks,
James
Thank you !!
I will try your way , and than report to you.
the problem solve by redesign code :
for (i = 0; i <= 360; i++)- Marked as answer by YuJinSu Wednesday, March 3, 2010 12:32 PM
Monday, March 1, 2010 7:12 AM