Asked by:
MPI_Recv - fatal error - received packet of unknown type (1234)

Question
-
Hello!
Windows 7, two machines (BIGBROTHER and LITTLESISTER), home local network, HPC Pack 2012 MS-MPI Redistributable Package with Service Pack 1.
When I try to use MPI_Send (on BIGBROTHER) and MPI_Recv (on LITTLESISTER) my program return error:
job aborted:
[ranks] message
[0] terminated
[1] fatal error
Fatal error in PMPI_Recv: Internal MPI error!, error stack:
MPI_Recv(buf=0x0042FBDC, count=64, MPI_CHAR, src=0, tag=0, MPI_COMM_WORLD, status=0x0042FB78) failed
[ch3:sock] received packet of unknown type (1234)
---------------------------------------------------------
My code here:
http://www.sourcepod.com/psznnl57-21098
-------------------------------------------------------------------------------
I tried to google error massage, but there is nothing. There is the very strange error. And MPI_Send work fine, why MPI_Recv return fatal error?
- Edited by ivanov.jaroslaw Monday, December 9, 2013 3:39 PM don't find in web
Sunday, December 8, 2013 10:32 AM
All replies
-
Hi Ivanov,
Does the code work locally on a single machine? (i.e., on either BIGBROTHER or LITTLESISTER: mpiexec -n 2 mpi_hello.exe? If this works locally and not when you use the two hosts, it's possible they're using mixed version of msmpi (or different MPI). Dependency Walker will be able to tell you which MPI mpi_hello is linked against
I just tried your code both locally and on two different hosts (mpiexec -hosts 2 host1 1 host2 1 mpi_hello.exe) and it works fine for me.
Can you tell me how you launched the job and the output of the following:
1) where mpiexec.exe
2) where msmpi.dll
Anh
Thanks
Monday, December 9, 2013 4:09 PM -
Yes, the code work locally on a single machine (either BIGBROTHER and LITTLESISTER):
runas /user:BIGBROTHER\cluster "cmd /K mpiexec -n 2 \\BIGBROTHER\MPI\mpi_comm.exe"
runas /user:LITTLESISTER\cluster "cmd /K mpiexec -n 2 \\BIGBROTHER\MPI\mpi_comm.exe"
But when I try to launch mpi_comm.exe on both machines, there is the error: Fatal error in PMPI_Recv: Internal MPI error!, error stack: ...
on BIGBROTHER machine (smpd works on either):
runas /user:BIGBROTHER\cluster "cmd /K mpiexec -hosts 2 BIGBROTHER 1 LITTLESISTER 1 \\BIGBROTHER\MPI\mpi_comm.exe"
-------------------------------------------------
mpi_hello works fine:
#include <stdio.h>
#include "mpi.h"
int main(int argc, char* argv[])
{
int procs_rank, procs_count;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &procs_count);
MPI_Comm_rank(MPI_COMM_WORLD, &procs_rank);
printf ("\n Hello, World from process %d of %d", procs_rank, procs_count);
MPI_Finalize();
return 0;
}but mpi_comm don't work (when I use MPI_Send and MPI_Recv):
int main(int argc, char *argv[])
{
int rank, nproc, name_len;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double start_time, end_time;
char send_buf[64], recv_buf[64];
MPI_Status st;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Get_processor_name(processor_name, &name_len);
start_time = MPI_Wtime();
switch (rank)
{
case 0:
sprintf(send_buf, "Hello from process 0");
MPI_Send(send_buf, 64, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
break;
default:
MPI_Recv(send_buf, 64, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &st);
printf("Process %d received %s\n", rank, send_buf);
}
end_time = MPI_Wtime();
printf("Work time %f sec\n", end_time - start_time);
MPI_Finalize();
return 0;
}
- Edited by ivanov.jaroslaw Monday, December 9, 2013 4:38 PM sample code
Monday, December 9, 2013 4:26 PM -
Hi Ivanov,
I did try mpi_comm (I just didn't know what you named it so I just assumed it was still named mpi_hello) and it worked for me. Can you send me the following output:
runas /user:BIGBROTHER\cluster "cmd /K where mpiexec.exe && where msmpi.dll"
Monday, December 9, 2013 8:02 PM -
Can user BIGBROTHER run programs on LITTLESISTER machine?I doubt.
/user:BIGBROTHER\cluster -> means local user credentials on BIGBROTHER machine but not on LITTLESISTER
I suggest to create account with the same name and password on your both machines or create domain environment and run mpiexec with domain user credentials.
Daniel Drypczewski
Tuesday, December 10, 2013 3:50 AM -
And do the LITTLESISTER machine has granted access to shared folder \\BIGBROTHER\MPI ?
Daniel Drypczewski
Tuesday, December 10, 2013 3:53 AM -
Hi Anh, I decided that I should printscreen how my programs works:
1) http://postimg.org/gallery/18zlnl5vm/39a082ad/
So, codes of mpi_hello and mpi_comm places in my last message. mpi_hello works fine on BIGBROTHER, LITTLESISTER and both together (when I run on BIGBROTHER, and when I run on LITTLESISTER). mpi_comm works fine on BIGBROTHER, LITTLESISTER, but NOT both together (MPI_Recv error).
In the pictures I see some strange thing about mpi_comm. Process 0 should send message to process 1. Process 1 should print, that it get message. So, Process 0 should start earlier, yes? But in printsreen I see, that earlier run process 1, next process 0. Very-very strange, I think.
Do you understand, that I run 2 programs? mpi_hello and mpi_comm. Do mpi_comm works on your machines?
2) About mpiexec.exe and msmpi.dll.
C:\Program Files\Microsoft HPC Pack 2012\Bin\mpieec.exe
C:\Windows\Sysrem32\msmpi.dll
But why you ask me about that? If mpi_hello works fine.
Saturday, December 14, 2013 4:42 PM -
Hi Daniel,
machine BIGBROTHER run programs on LITTLESISTER machine, you can see it on printscreens (http://postimg.org/gallery/18zlnl5vm/39a082ad/).
I start mpiexec on BIGBROTHER machine as cluster/cluster and I can set to mpiexec run on LITTLESISTER machine (and conversely)
I create a user "cluster" with password "cluster" on both machines. Maybe I don't should run mpiexec through RUNAS as BIGBROTHER\cluster or LITTLESISTER\cluster? When I start mpi_hello programs it works fine. The error appears when I start mpi_comm program:
int main(int argc, char *argv[])
{
int rank, nproc, name_len;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double start_time, end_time;
char send_buf[64], recv_buf[64];
MPI_Status st;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
MPI_Get_processor_name(processor_name, &name_len);
start_time = MPI_Wtime();
switch (rank)
{
case 0:
sprintf(send_buf, "Hello from process 0");
MPI_Send(send_buf, 64, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
break;
default:
MPI_Recv(send_buf, 64, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &st);
printf("Process %d received %s\n", rank, send_buf);
}
end_time = MPI_Wtime();
printf("Work time %f sec\n", end_time - start_time);
MPI_Finalize();
return 0;
}Saturday, December 14, 2013 4:48 PM -
Yes, mpi_hello works fine on bothmachines
http://postimg.org/image/rf7poxkap/ - start
http://postimg.org/image/6m0pms2dd/ - work fine
Saturday, December 14, 2013 4:50 PM -
Hi Anh, I decided that I should printscreen how my programs works:
1) http://postimg.org/gallery/18zlnl5vm/39a082ad/
So, codes of mpi_hello and mpi_comm places in my last message. mpi_hello works fine on BIGBROTHER, LITTLESISTER and both together (when I run on BIGBROTHER, and when I run on LITTLESISTER). mpi_comm works fine on BIGBROTHER, LITTLESISTER, but NOT both together (MPI_Recv error).
In the pictures I see some strange thing about mpi_comm. Process 0 should send message to process 1. Process 1 should print, that it get message. So, Process 0 should start earlier, yes? But in printsreen I see, that earlier run process 1, next process 0. Very-very strange, I think.
Do you understand, that I run 2 programs? mpi_hello and mpi_comm. Do mpi_comm works on your machines?
2) About mpiexec.exe and msmpi.dll.
C:\Program Files\Microsoft HPC Pack 2012\Bin\mpieec.exe
C:\Windows\Sysrem32\msmpi.dll
But why you ask me about that? If mpi_hello works fine.
Yes, mpi_comm works fine on my machine. And I also tried setting up a different user and ran the commands similar to yours. If you login to both boxes as user cluster, does it still work?
If not, can you try running this:
runas /user:BIGBROTHER\cluster "cmd /K mpiexec -env MSMPI_DUMP_MODE 4 -hosts 2 BIGBROTHER 1 LITTLESISTER 1 \\BIGBROTHER\MPI\mpi_comm.exe"
In the user profile directory (of the user cluster) you should see the dump files for each process. If you can upload the dump files we can help investigate this
Thanks
Anh
Tuesday, December 17, 2013 9:53 PM -
When I login as cluster user on both machines (last trying I login as administrator Yaroslaw on both machines) there is error, when I try start mpi_hello or mpi_comm:
Aborting: Access denied by node 'LITTLESISTER'
The smpd daemon is running with user credentials which are differenr from the user running the job.
Maybe I should create "cluster" user with special options? Or create network user? I created cluster users like common user on both machines separate (just with same name and password). Is it correct? Maybe there are two different cluster users?
Oh... Now, when I try to start my program with last scheme, I get the erorr "Aborting: Access denied..."
Oh my god. What I do wrong?
I power on 1 machine and 2 machine in Vmware. 1 machine have sharing folder, 2 machine see it. Setup firewall on both machine (mpi_hello.exe, smpd and mpiexec)
I run smpd on both machine:
runas /user:BIGBROTHER\cluster "C:\Program Files\Microsoft HPC Pack 2012\Bin\smpd.exe -d"
on 1 machine
runas /user:LITTLESISTER\cluster "C:\Program Files\Microsoft HPC Pack 2012\Bin\smpd.exe -d"
on 2 machine
On 1 machine I run:runas /user:BIGBROTHER\cluster "cmd /K mpiexec -hosts 2 BIGBROTHER 1 LITTLESISTER 1 \\BIGBROTHER\MPI\mpi_comm.exe"
And it return access denied error.
Wednesday, December 18, 2013 9:10 AM