none
Mpi v7 smpd issue (error 1726 and 1722) RRS feed

  • Question

  • Hi. i need your help

    I try to run my program on several hosts. I  use this command:

    mpiexec -hosts 2 host1 host2 program

    But SMPD is faile to connect to another host.

    Information about system:

    1. all hosts use the same user with password

    2. all hosts have the same mpi version

    3. all hosts are admins

    4. firewall is disabled

    5. All in one network

    Slave Host:

    [02:3972] ERROR: Failed to connect back to parent error 1722.

    Full debug information of slave host:

    [-1:5824] Authentication completed. Successfully obtained Context for Client.
    [-1:5824] version check complete, using PMP version 2.
    [-1:5824] create manager process (using smpd daemon credentials)
    [-1:5824] smpd reading the port string from the manager
    [-1:4284] Launching smpd manager instance.
    [-1:4284] created set for manager listener, 20
    [-1:4284] smpd manager listening on port 49593
    [-1:4284] manager writing port back to smpd.
    [-1:5824] closing the pipe to the manager
    [-1:4284] Authentication completed. Successfully obtained Context for Client.
    [-1:4284] Authorization completed.
    [-1:4284] version check complete, using PMP version 2.
    [-1:4284] Received session header from parent id=2, parent=1, level=1
    [02:4284] Connecting back to parent using host localhost and endpoint 49451
    [02:4284] ERROR: Failed to connect back to parent error 1722.
    [02:4284] smpd manager successfully stopped listening.
    [02:4284] SMPD exiting with error code 4294967293.

    Master Host 

    Aborting: smpd on HOST1 is unable to connect to the smpd manager on host2:49463 error 1726

    Full debug information of master host:

    [-1:5528] Authentication completed. Successfully obtained Context for Client.
    [-1:5528] version check complete, using PMP version 2.
    [-1:5528] create manager process (using smpd daemon credentials)
    [-1:5528] smpd reading the port string from the manager
    [-1:4084] Launching smpd manager instance.
    [-1:4084] created set for manager listener, 232
    [-1:4084] smpd manager listening on port 49641
    [-1:4084] manager writing port back to smpd.
    [-1:5528] closing the pipe to the manager
    [-1:4084] Authentication completed. Successfully obtained Context for Client.
    [-1:4084] Authorization completed.
    [-1:4084] version check complete, using PMP version 2.
    [-1:4084] Received session header from parent id=1, parent=0, level=0
    [01:4084] Connecting back to parent using host localhost and endpoint 49635
    [01:4084] Authentication completed. Successfully obtained Context for Client.
    [01:4084] Authorization completed.
    [01:4084] handling command SMPD_CONNECT src=0
    [01:4084] now connecting to host2
    [01:4084] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
    [01:4084] HOST1 posting a re-connect to host2:49463 in left child context.
    [01:4084] sending abort command to parent context.
    [01:4084] posting command SMPD_ABORT to parent, src=1, dest=0.
    [01:4084] ERROR: Failed to connect to SMPD Manager Instance error 1726
    [01:4084] ERROR: smpd running on HOST1 is unable to connect to smpd service on c
    host2:8677
    [01:4084] Handling cmd=SMPD_ABORT result
    [01:4084] cmd=SMPD_ABORT result will be handled locally
    [01:4084] parent terminated unexpectedly - initiating cleaning up.
    [01:4084] no child processes to kill - exiting with error code -1

     I have no idea cause Ive already run program on ther hosts and there was no problem.



    • Edited by AwK55 Friday, April 29, 2016 5:40 PM additional info
    Friday, April 29, 2016 5:33 PM

Answers

All replies

  • A couple questions:

    1) Did you install MSMPISetup.exe on both the hosts or did you manually copy the MSMPI binaries?

    2) Which host are you running mpiexec from, is it host1 or host2?

    Can you please run the following experiments and provide us the result

    1) On host1, start a console with smpd -d 3. Then from host1, run mpiexec -d 3 -host host1 -n 1 hostname

    2) On host2, start a console with smpd -d 3. Then from host1 (note that you want to run this from host1, NOT host2), run mpiexec -d 3 -host host2 -n 1 hostname

    2) From host2, run mpiexec -d 3 -host host1 -n 1 hostname

    Thanks

    Anh

    Friday, April 29, 2016 6:17 PM
  • 1.I install MSMPISetup.exe 

    2. I run on both 

    Experiments

    1. (from smpd window)

    [-1:6024] Authentication completed. Successfully obtained Context for Client.
    [-1:6024] version check complete, using PMP version 2.
    [-1:6024] create manager process (using smpd daemon credentials)
    [-1:6024] smpd reading the port string from the manager
    [-1:1936] Launching smpd manager instance.
    [-1:1936] created set for manager listener, 20
    [-1:1936] smpd manager listening on port 51569
    [-1:1936] manager writing port back to smpd.
    [-1:6024] closing the pipe to the manager
    [-1:1936] Authentication completed. Successfully obtained Context for Client.
    [-1:1936] Authorization completed.
    [-1:1936] version check complete, using PMP version 2.
    [-1:1936] Received session header from parent id=1, parent=0, level=0
    [01:1936] Connecting back to parent using host localhost and endpoint 51567
    [01:1936] Authentication completed. Successfully obtained Context for Client.
    [01:1936] Authorization completed.
    [01:1936] handling command SMPD_COLLECT src=0
    [01:1936] handling command SMPD_STARTDBS src=0
    [01:1936] sending start_dbs result command kvs = a2f02c16-dac4-4dd7-a257-29de78b
    3f3a1.
    [01:1936] handling command SMPD_LAUNCH src=0
    [01:1936] Successfully handled bcast nodeids command.
    [01:1936] setting environment variable: <MPIEXEC_HOSTNAME> = <HOST!>
    [01:1936] env: PMI_SIZE=1
    [01:1936] env: PMI_KVS=a2f02c16-dac4-4dd7-a257-29de78b3f3a1
    [01:1936] env: PMI_DOMAIN=9d313d54-71f0-44df-ab39-4adc18d8f2e4
    [01:1936] env: PMI_HOST=localhost
    [01:1936] env: PMI_PORT=0dd901d4-4cef-4fa5-8333-ceb0b1e72e61
    [01:1936] env: PMI_SMPD_ID=1
    [01:1936] env: PMI_APPNUM=0
    [01:1936] env: PMI_NODE_IDS=s
    [01:1936] env: PMI_RANK_AFFINITIES=a
    [01:1936] searching for 'hostname' in workdir 'D:\kiselev\crowdEnsemble\pSPEA2'
    [01:1936] searching for 'hostname' in path ''
    [01:1936] searching for 'hostname' in system path
    [01:1936] D:\kiselev\crowdEnsemble\pSPEA2>CreateProcess(C:\Windows\SYSTEM32\host
    name.exe hostname)
    [01:1936] env: PMI_RANK=0
    [01:1936] env: PMI_SMPD_KEY=0
    [01:1936] read 5 bytes from stdout
    [01:1936] posting command SMPD_STDOUT to parent, src=1, dest=0.
    [01:1936] read 2 bytes from stdout
    [01:1936] posting command SMPD_STDOUT to parent, src=1, dest=0.
    [01:1936] reading failed, assuming stdout is closed. error 0xc000014b
    [01:1936] process_id=0 process refcount == 1, stdout closed.
    [01:1936] reading failed, assuming stderr is closed. error 0xc000014b
    [01:1936] process_id=0 process refcount == 0, stderr closed.
    [01:1936] process_id=0 process refcount == 0, waiting for the process to finish
    exiting.
    [01:1936] creating an exit command for rank 0, pid 5924, exit code 0.
    [01:1936] posting command SMPD_EXIT to parent, src=1, dest=0.
    [01:1936] Handling cmd=SMPD_STDOUT result
    [01:1936] cmd=SMPD_STDOUT result will be handled locally
    [01:1936] Handling cmd=SMPD_STDOUT result
    [01:1936] cmd=SMPD_STDOUT result will be handled locally
    [01:1936] handling command SMPD_CLOSE src=0
    [01:1936] sending 'closed' command to parent context
    [01:1936] posting command SMPD_CLOSED to parent, src=1, dest=0.
    [01:1936] Handling cmd=SMPD_EXIT result
    [01:1936] cmd=SMPD_EXIT result will be handled locally
    [01:1936] Handling cmd=SMPD_CLOSED result
    [01:1936] cmd=SMPD_CLOSED result will be handled locally
    [01:1936] smpd manager successfully stopped listening.
    [01:1936] SMPD exiting with error code 0.

    2. 

    c:\users\name>mpiexec -d 3 -host host2 -n 1 hostname
    [00:3588] host tree:
    [00:3588]  host: host2, parent: 0, id: 1
    [00:3588] mpiexec started smpd manager listening on port 51586
    [00:3588] HOST1 posting a re-connect to host2:49836 in left child context.

    Aborting: mpiexec on HOST1 is unable to connect to the smpd manager on host2:49
    836 error 1726
    [00:3588] ERROR: Failed to connect to SMPD Manager Instance error 1726
    [00:3588] smpd manager successfully stopped listening.

    3. the same as previous

    c:\user\name>mpiexec -d 3 -host host1-n 1 hostname
    [00:5924] host tree:
    [00:5924] host: host1, parent: 0, id: 1
    [00:5924] mpiexec started smpd manager listening on port 49850
    [00:5924] HOST2 posting a re-connect to host1:51611 in left child context.

    Aborting: mpiexec on HOST2is unable to connect to the smpd manager on host1:51
    611 error 1726
    [00:5924] ERROR: Failed to connect to SMPD Manager Instance error 1726
    [00:5924] smpd manager successfully stopped listening.

    Thanks

    Saturday, April 30, 2016 5:11 AM
  • Hi there,

    In the 2) experiment , can you provide the output of the smpd -d 3 console on host2? (while running mpiexec on host1 with mpiexec -d 3 -host host2 -n 1 hostname)

    If you use the IP address instead of the name host1/host2, does it work?

    Thanks

    Anh

    Saturday, April 30, 2016 5:18 AM
  • 1. Host2 (smpd -d 3)

    C:\Users\name>smpd -d 3
    [-1:5748] Launching SMPD service.
    [-1:5748] smpd listening on port 8677
    [-1:5748] Authentication completed. Successfully obtained Context for Client.
    [-1:5748] version check complete, using PMP version 2.
    [-1:5748] create manager process (using smpd daemon credentials)
    [-1:5748] smpd reading the port string from the manager
    [-1:1892] Launching smpd manager instance.
    [-1:1892] created set for manager listener, 232
    [-1:1892] smpd manager listening on port 49927
    [-1:1892] manager writing port back to smpd.
    [-1:5748] closing the pipe to the manager
    [-1:1892] Authentication completed. Successfully obtained Context for Client.
    [-1:1892] Authorization completed.
    [-1:1892] version check complete, using PMP version 2.
    [-1:1892] Received session header from parent id=1, parent=0, level=0
    [01:1892] Connecting back to parent using host localhost and endpoint 51859
    [01:1892] ERROR: Failed to connect back to parent error 1722.
    [01:1892] smpd manager successfully stopped listening.
    [01:1892] SMPD exiting with error code 4294967293.

    2. with IP address (Not work)

    host1

    c:\user\name>mpiexec -d 3 -host 168.192.149.37 -n 1 hostname
    [00:864] host tree:
    [00:864]  host: 168.192.149.37, parent: 0, id: 1
    [00:864] mpiexec started smpd manager listening on port 51848
    [00:864] HOST1 posting a re-connect to 168.192.149.37:49922 in left child context
    .
    [00:864] Authentication completed. Successfully obtained Context for Client.
    [00:864] Authorization completed.
    [00:864] version check complete, using PMP version 2.
    [00:864] posting command SMPD_COLLECT to left child, src=0, dest=1.
    [00:864] Handling cmd=SMPD_COLLECT result
    [00:864] cmd=SMPD_COLLECT result will be handled locally
    [00:864] Finished collecting hardware summary.
    [00:864] posting command SMPD_STARTDBS to left child, src=0, dest=1.
    [00:864] Handling cmd=SMPD_STARTDBS result
    [00:864] cmd=SMPD_STARTDBS result will be handled locally
    [00:864] start_dbs succeeded, kvs_name: '21128e07-8fb1-403b-ba91-c1385dc42f9e',
    domain_name: '1dedfd29-45e2-4eab-be5c-ccba06625586'
    [00:864] creating a process group of size 1 on node 0 called 21128e07-8fb1-403b-
    ba91-c1385dc42f9e
    [00:864] launching the processes.
    [00:864] posting command SMPD_LAUNCH to left child, src=0, dest=1.
    [00:864] Handling cmd=SMPD_LAUNCH result
    [00:864] cmd=SMPD_LAUNCH result will be handled locally
    [00:864] successfully launched process 0
    [00:864] root process launched, starting stdin redirection.
    [00:864] Authentication completed. Successfully obtained Context for Client.
    [00:864] Authorization completed.
    [00:864] handling command SMPD_STDOUT src=1
    [00:864] Handling SMPD_STDOUT
    [00:864] Decoding stdout/stderr buffer 636F6D703130
    host2

    [00:864] handling command SMPD_STDOUT src=1
    [00:864] Handling SMPD_STDOUT
    [00:864] Decoding stdout/stderr buffer 0D0A

    [00:864] handling command SMPD_EXIT src=1
    [00:864] saving exit code: rank 0, exitcode 0, pg <21128e07-8fb1-403b-ba91-c1385
    dc42f9e>
    [00:864] process exited without calling init.
    [00:864] process exited before anyone has called init.
    [00:864] last process exited, tearing down the job tree.
    [00:864] posting command SMPD_CLOSE to left child, src=0, dest=1.
    [00:864] Handling cmd=SMPD_CLOSE result
    [00:864] cmd=SMPD_CLOSE result will be handled locally
    [00:864] handling command SMPD_CLOSED src=1
    [00:864] closed command received from left child.
    [00:864] smpd manager successfully stopped listening.

    host2 (smpd -d 3)

    C:\Users\name>smpd -d 3
    [-1:1896] Launching SMPD service.
    [-1:1896] smpd listening on port 8677
    [-1:1896] Authentication completed. Successfully obtained Context for Client.
    [-1:1896] version check complete, using PMP version 2.
    [-1:1896] create manager process (using smpd daemon credentials)
    [-1:1896] smpd reading the port string from the manager
    [-1:5800] Launching smpd manager instance.
    [-1:5800] created set for manager listener, 232
    [-1:5800] smpd manager listening on port 49958
    [-1:5800] manager writing port back to smpd.
    [-1:1896] closing the pipe to the manager
    [-1:5800] Authentication completed. Successfully obtained Context for Client.
    [-1:5800] Authorization completed.
    [-1:5800] version check complete, using PMP version 2.
    [-1:5800] Received session header from parent id=1, parent=0, level=0
    [01:5800] Connecting back to parent using host 192.168.149.36 and endpoint 51889
    [01:5800] Authentication completed. Successfully obtained Context for Client.
    [01:5800] Authorization completed.
    [01:5800] handling command SMPD_COLLECT src=0
    [01:5800] handling command SMPD_STARTDBS src=0
    [01:5800] sending start_dbs result command kvs = a6565433-2603-41b5-ba96-c250363
    dbaeb.
    [01:5800] handling command SMPD_LAUNCH src=0
    [01:5800] Successfully handled bcast nodeids command.
    [01:5800] setting environment variable: <MPIEXEC_HOSTNAME> = <HOST1>
    [01:5800] env: PMI_SIZE=1
    [01:5800] env: PMI_KVS=a6565433-2603-41b5-ba96-c250363dbaeb
    [01:5800] env: PMI_DOMAIN=f7e5f858-17a6-4752-97e3-6cd9abf6fc7d
    [01:5800] env: PMI_HOST=localhost
    [01:5800] env: PMI_PORT=ca74f44a-f723-43d8-a5de-1a421b762d96
    [01:5800] env: PMI_SMPD_ID=1
    [01:5800] env: PMI_APPNUM=0
    [01:5800] env: PMI_NODE_IDS=s
    [01:5800] env: PMI_RANK_AFFINITIES=a
    [01:5800] searching for 'hostname' in workdir 'c:\user\'
    [01:5800] searching for 'hostname' in path ''
    [01:5800] searching for 'hostname' in system path
    [01:5800] c:\user\>CreateProcess(C:\Windows\SYSTEM32\host
    name.exe hostname)
    [01:5800] env: PMI_RANK=0
    [01:5800] env: PMI_SMPD_KEY=0
    [01:5800] read 6 bytes from stdout
    [01:5800] posting command SMPD_STDOUT to parent, src=1, dest=0.
    [01:5800] read 2 bytes from stdout
    [01:5800] posting command SMPD_STDOUT to parent, src=1, dest=0.
    [01:5800] ERROR: unable to post a read on stdout context, error 109.
    [01:5800] process_id=0 process refcount == 1, stdout closed.
    [01:5800] reading failed, assuming stderr is closed. error 0xc000014b
    [01:5800] process_id=0 process refcount == 0, stderr closed.
    [01:5800] process_id=0 process refcount == 0, waiting for the process to finish
    exiting.
    [01:5800] creating an exit command for rank 0, pid 4972, exit code 0.
    [01:5800] posting command SMPD_EXIT to parent, src=1, dest=0.
    [01:5800] Handling cmd=SMPD_STDOUT result
    [01:5800] cmd=SMPD_STDOUT result will be handled locally
    [01:5800] Handling cmd=SMPD_STDOUT result
    [01:5800] cmd=SMPD_STDOUT result will be handled locally
    [01:5800] handling command SMPD_CLOSE src=0
    [01:5800] sending 'closed' command to parent context
    [01:5800] posting command SMPD_CLOSED to parent, src=1, dest=0.
    [01:5800] Handling cmd=SMPD_EXIT result
    [01:5800] cmd=SMPD_EXIT result will be handled locally
    [01:5800] Handling cmd=SMPD_CLOSED result
    [01:5800] cmd=SMPD_CLOSED result will be handled locally
    [01:5800] smpd manager successfully stopped listening.
    [01:5800] SMPD exiting with error code 0.


    I should add that mpich2 was installed earlier on this computers but i uninstalled and clean up registry.
    • Edited by AwK55 Saturday, April 30, 2016 6:51 AM add some info
    Saturday, April 30, 2016 6:22 AM
  • Hi there,

    I think this is an issue with MS-MPI v7 not handling the link-local IPv6 addresses properly and thus you see on the remote host smpd manager tried to call back to localhost instead of the other host.

    This issue was fixed in MS-MPI v7.1 which will be released in a couple weeks. In the mean time the workaround would be to use the IP addresses. I'll post a reply here when v7.1 has been released.

    Thanks

    Anh

    Monday, May 2, 2016 10:01 PM
  • Hi,

    The pre-release of MS-MPI v7.1 is now available for download :

    https://www.microsoft.com/en-us/download/details.aspx?id=52042

    We would appreciate it if you could give it a try and let us know if you no longer experience the reported issue with the new version

    Thanks

    Anh

    • Marked as answer by AwK55 Tuesday, May 10, 2016 10:49 AM
    Wednesday, May 4, 2016 8:12 PM
  • Hi, 

    Frist of all, I appriciate for your supporting. 

    MS-MPI v7.1 works for me and all computers connect to each other. But sometimes this error appears and connection between computers is very slow. Actually, I guess it can be network problem now. 

    Thanks again.


    • Edited by AwK55 Tuesday, May 10, 2016 1:04 PM
    Tuesday, May 10, 2016 10:49 AM