none
MS-MPI Launch service error running on remote host RRS feed

  • Question

  • I have successfully run a job on the local host (mpiexec and launch service on the same machine) using the new in MS-MPI v7 Launch service. However, every effort to run the same job on a remote host presents with the following error:

    ERROR: Failed RpcCliCreateContext error 1722
    
    Aborting: mpiexec on localhost is unable to connect to the smpd service on remotehost:8677
    Other MPI error, error stack:
    connect failed - The RPC server is unavailable. (errno 1722)
    

    Before starting the Launch service, I have to disable the HPC MPI service first. I have also provided the remotehost with the user password. I assume that HPC Pack has taken care of the firewall settings.

    Is there anything else I should do?

    Regards,

    Costas

    Friday, February 12, 2016 11:27 AM

Answers

  • Hi Costas,

    A couple things we can try:

    1) Double check windows firewall setting and make sure the appropriate rules for the launch service/smpd/mpiexec are set

    2) If you login to the remote host and run the following command, does it work?
    mpiexec -host localhost -n 1 hostname

    Anh

    Friday, February 12, 2016 12:03 PM

All replies

  • Hi Costas,

    A couple things we can try:

    1) Double check windows firewall setting and make sure the appropriate rules for the launch service/smpd/mpiexec are set

    2) If you login to the remote host and run the following command, does it work?
    mpiexec -host localhost -n 1 hostname

    Anh

    Friday, February 12, 2016 12:03 PM
  • Hi Anh,

    1) Firewall inbound/outbound rules are there for the launch service/smpd/mpiexec in both nodes currently testing on.

    2) I entered the command from the remote host and after a while I got the exact same error.

    Costas

    Friday, February 12, 2016 12:18 PM
  • Hi Costas,

    On the local AND remote host, what is the output of the following command (please copy/paste output for both local and remote host, even though they might look identical)

    sc query MSMPILaunchSvc

    Thanks

    Anh

    Tuesday, February 16, 2016 6:50 PM
  • Hello Anh,

    Sorry for the delay, I had a big queue of computations waiting and could not stop the smpd service. Here are the outputs:

    Local host:

    SERVICE_NAME: MSMPILaunchSvc 
            TYPE               : 10  WIN32_OWN_PROCESS  
            STATE              : 4  RUNNING 
                                    (STOPPABLE, NOT_PAUSABLE, ACCEPTS_SHUTDOWN)
            WIN32_EXIT_CODE    : 0  (0x0)
            SERVICE_EXIT_CODE  : 0  (0x0)
            CHECKPOINT         : 0x0
            WAIT_HINT          : 0x0
    

    Remote host:

    SERVICE_NAME: MSMPILaunchsvc 
            TYPE               : 10  WIN32_OWN_PROCESS  
            STATE              : 4  RUNNING 
                                    (STOPPABLE, NOT_PAUSABLE, ACCEPTS_SHUTDOWN)
            WIN32_EXIT_CODE    : 0  (0x0)
            SERVICE_EXIT_CODE  : 0  (0x0)
            CHECKPOINT         : 0x0
            WAIT_HINT          : 0x0
    

    Costas

    Friday, March 18, 2016 9:34 AM
  • Hi Costas,

    Can you try the following step and provide us with the output:

    1) stop the msmpilaunchsvc on the remote host (net stop msmpilaunchsvc)

    2) open a console on the remote host and type "smpd -d 3" - if the console output has ERROR you don't need step 3).

    3) on the localhost, run the command mpiexec -d 3 -host remotehost -n 1 hostname

    Thanks

    Anh

    Monday, March 21, 2016 10:26 PM
  • Hi Anh,

    Here is the output from the localhost (thor):

    C:\Users\cyamin>mpiexec -d 3 -host atlas -n 1 hostname
    [00:9756] host tree:
    [00:9756]  host: atlas, parent: 0, id: 1
    [00:9756] mpiexec started smpd manager listening on port 61568
    [00:9756] THOR posting a re-connect to atlas:51424 in left child context.
    [00:9756] Authentication completed. Successfully obtained Context for Client.
    [00:9756] Authorization completed.
    [00:9756] version check complete, using PMP version 2.
    [00:9756] posting command SMPD_COLLECT to left child, src=0, dest=1.
    [00:9756] Handling cmd=SMPD_COLLECT result
    [00:9756] cmd=SMPD_COLLECT result will be handled locally
    [00:9756] Finished collecting hardware summary.
    [00:9756] posting command SMPD_STARTDBS to left child, src=0, dest=1.
    [00:9756] Handling cmd=SMPD_STARTDBS result
    [00:9756] cmd=SMPD_STARTDBS result will be handled locally
    [00:9756] start_dbs succeeded, kvs_name: '3c70836e-cd31-46cf-9755-ce3388c0d0cc', domain_name: '07263852-3f5e-4d68-a1d7-e
    d526fe2ac7c'
    [00:9756] creating a process group of size 1 on node 0 called 3c70836e-cd31-46cf-9755-ce3388c0d0cc
    [00:9756] launching the processes.
    [00:9756] posting command SMPD_LAUNCH to left child, src=0, dest=1.
    [00:9756] Handling cmd=SMPD_LAUNCH result
    [00:9756] cmd=SMPD_LAUNCH result will be handled locally
    [00:9756] successfully launched process 0
    [00:9756] root process launched, starting stdin redirection.
    [00:9756] Authentication completed. Successfully obtained Context for Client.
    [00:9756] Authorization completed.
    [00:9756] handling command SMPD_STDOUT src=1
    [00:9756] Handling SMPD_STDOUT
    [00:9756] Decoding stdout/stderr buffer 41544C4153
    ATLAS[00:9756] handling command SMPD_STDOUT src=1
    [00:9756] Handling SMPD_STDOUT
    [00:9756] Decoding stdout/stderr buffer 0D0A
    
    [00:9756] handling command SMPD_EXIT src=1
    [00:9756] saving exit code: rank 0, exitcode 0, pg <3c70836e-cd31-46cf-9755-ce3388c0d0cc>
    [00:9756] process exited without calling init.
    [00:9756] process exited before anyone has called init.
    [00:9756] last process exited, tearing down the job tree.
    [00:9756] posting command SMPD_CLOSE to left child, src=0, dest=1.
    [00:9756] Handling cmd=SMPD_CLOSE result
    [00:9756] cmd=SMPD_CLOSE result will be handled locally
    [00:9756] handling command SMPD_CLOSED src=1
    [00:9756] closed command received from left child.
    [00:9756] smpd manager successfully stopped listening.

    Here is the output from the remote host (atlas):

    [-1:1560] Launching SMPD service.
    [-1:1560] smpd listening on port 8677
    [-1:1560] Authentication completed. Successfully obtained Context for Client.
    [-1:1560] version check complete, using PMP version 2.
    [-1:1560] create manager process (using smpd daemon credentials)
    [-1:1560] smpd reading the port string from the manager
    [-1:5004] Launching smpd manager instance.
    [-1:5004] created set for manager listener, 200
    [-1:5004] smpd manager listening on port 51530
    [-1:5004] manager writing port back to smpd.
    [-1:1560] closing the pipe to the manager
    [-1:5004] Authentication completed. Successfully obtained Context for Client.
    [-1:5004] Authorization completed.
    [-1:5004] version check complete, using PMP version 2.
    [-1:5004] Received session header from parent id=1, parent=0, level=0
    [01:5004] Connecting back to parent using host 192.168.0.4 and endpoint 61604
    [01:5004] Authentication completed. Successfully obtained Context for Client.
    [01:5004] Authorization completed.
    [01:5004] handling command SMPD_COLLECT src=0
    [01:5004] handling command SMPD_STARTDBS src=0
    [01:5004] sending start_dbs result command kvs = bac843f7-c7bc-46dc-8bcf-fe59210
    de120.
    [01:5004] handling command SMPD_LAUNCH src=0
    [01:5004] Successfully handled bcast nodeids command.
    [01:5004] setting environment variable: <MPIEXEC_HOSTNAME> = <THOR>
    [01:5004] env: PMI_SIZE=1
    [01:5004] env: PMI_KVS=bac843f7-c7bc-46dc-8bcf-fe59210de120
    [01:5004] env: PMI_DOMAIN=52809501-672f-4b79-8ba9-ac9c7cbe44e2
    [01:5004] env: PMI_HOST=localhost
    [01:5004] env: PMI_PORT=60297d8f-7a3c-415c-bd3f-b46d5b53f765
    [01:5004] env: PMI_SMPD_ID=1
    [01:5004] env: PMI_APPNUM=0
    [01:5004] env: PMI_NODE_IDS=s
    [01:5004] env: PMI_RANK_AFFINITIES=a
    [01:5004] searching for 'hostname' in workdir 'C:\Users\cyamin'
    [01:5004] searching for 'hostname' in path ''
    [01:5004] searching for 'hostname' in system path
    [01:5004] C:\Users\cyamin>CreateProcess(C:\Windows\SYSTEM32\hostname.exe hostnam
    e)
    [01:5004] env: PMI_RANK=0
    [01:5004] env: PMI_SMPD_KEY=0
    [01:5004] read 5 bytes from stdout
    [01:5004] posting command SMPD_STDOUT to parent, src=1, dest=0.
    [01:5004] read 2 bytes from stdout
    [01:5004] posting command SMPD_STDOUT to parent, src=1, dest=0.
    [01:5004] ERROR: unable to post a read on stdout context, error 109.
    [01:5004] process_id=0 process refcount == 1, stdout closed.
    [01:5004] reading failed, assuming stderr is closed. error 0xc000014b
    [01:5004] process_id=0 process refcount == 0, stderr closed.
    [01:5004] process_id=0 process refcount == 0, waiting for the process to finish
    exiting.
    [01:5004] creating an exit command for rank 0, pid 16352, exit code 0.
    [01:5004] posting command SMPD_EXIT to parent, src=1, dest=0.
    [01:5004] Handling cmd=SMPD_STDOUT result
    [01:5004] cmd=SMPD_STDOUT result will be handled locally
    [01:5004] Handling cmd=SMPD_STDOUT result
    [01:5004] cmd=SMPD_STDOUT result will be handled locally
    [01:5004] Handling cmd=SMPD_EXIT result
    [01:5004] cmd=SMPD_EXIT result will be handled locally
    [01:5004] handling command SMPD_CLOSE src=0
    [01:5004] sending 'closed' command to parent context
    [01:5004] posting command SMPD_CLOSED to parent, src=1, dest=0.
    [01:5004] Handling cmd=SMPD_CLOSED result
    [01:5004] cmd=SMPD_CLOSED result will be handled locally
    [01:5004] smpd manager successfully stopped listening.
    [01:5004] SMPD exiting with error code 0.
    

    Regards,

    Costas

    Wednesday, March 23, 2016 9:19 AM
  • Hi Costas,

    If you now stop the "smpd -d 3" daemon on the remote host and restart the msmpilaunchsvc with "sc start msmpilaunchsvc" and rerun the same mpiexec command, does it work?

    Wednesday, March 23, 2016 8:11 PM
  • Hi Anh,

    I did as you said and it seems to work. Here is the output:

    Microsoft Windows [Version 6.3.9600]
    (c) 2013 Microsoft Corporation. All rights reserved.
    
    C:\Users\cyamin>mpiexec -d 3 -host atlas -n 1 hostname
    [00:21912] host tree:
    [00:21912]  host: atlas, parent: 0, id: 1
    [00:21912] mpiexec started smpd manager listening on port 65483
    [00:21912] Needs user password to start SMPD manager.
    
    Enter Password for CNWAY\cyamin:
    Save Credentials[y|n]? y
    [00:21912] THOR posting a re-connect to atlas:63955 in left child context.
    [00:21912] Authentication completed. Successfully obtained Context for Client.
    [00:21912] Authorization completed.
    [00:21912] version check complete, using PMP version 2.
    [00:21912] posting command SMPD_COLLECT to left child, src=0, dest=1.
    [00:21912] Handling cmd=SMPD_COLLECT result
    [00:21912] cmd=SMPD_COLLECT result will be handled locally
    [00:21912] Finished collecting hardware summary.
    [00:21912] posting command SMPD_STARTDBS to left child, src=0, dest=1.
    [00:21912] Handling cmd=SMPD_STARTDBS result
    [00:21912] cmd=SMPD_STARTDBS result will be handled locally
    [00:21912] start_dbs succeeded, kvs_name: '35a69e75-e53e-411f-b93f-a71549695481', domain_name: '5936c8bc-8677-43ff-bcd1-
    5d20b34791d0'
    [00:21912] creating a process group of size 1 on node 0 called 35a69e75-e53e-411f-b93f-a71549695481
    [00:21912] launching the processes.
    [00:21912] posting command SMPD_LAUNCH to left child, src=0, dest=1.
    [00:21912] Handling cmd=SMPD_LAUNCH result
    [00:21912] cmd=SMPD_LAUNCH result will be handled locally
    [00:21912] successfully launched process 0
    [00:21912] root process launched, starting stdin redirection.
    [00:21912] posting command SMPD_STDIN to left child, src=0, dest=1.
    [00:21912] Handling cmd=SMPD_STDIN result
    [00:21912] cmd=SMPD_STDIN result will be handled locally
    [00:21912] Authentication completed. Successfully obtained Context for Client.
    [00:21912] Authorization completed.
    [00:21912] handling command SMPD_STDOUT src=1
    [00:21912] Handling SMPD_STDOUT
    [00:21912] Decoding stdout/stderr buffer 41544C4153
    ATLAS[00:21912] handling command SMPD_STDOUT src=1
    [00:21912] Handling SMPD_STDOUT
    [00:21912] Decoding stdout/stderr buffer 0D0A
    
    [00:21912] handling command SMPD_EXIT src=1
    [00:21912] saving exit code: rank 0, exitcode 0, pg <35a69e75-e53e-411f-b93f-a71549695481>
    [00:21912] process exited without calling init.
    [00:21912] process exited before anyone has called init.
    [00:21912] last process exited, tearing down the job tree.
    [00:21912] posting command SMPD_CLOSE to left child, src=0, dest=1.
    [00:21912] Handling cmd=SMPD_CLOSE result
    [00:21912] cmd=SMPD_CLOSE result will be handled locally
    [00:21912] handling command SMPD_CLOSED src=1
    [00:21912] closed command received from left child.
    [00:21912] smpd manager successfully stopped listening.
    
    C:\Users\cyamin>

    I tried to run a computation right after that and it worked. Maybe I hadn't employed the "-pwd and -savecreds" options correctly in my launch script. Passing the credentials manually is something I didn't do so far, so authentication could be the reason for the failure.

    Thanks for the help,

    Regards,

    Costas


    Thursday, March 24, 2016 2:16 PM
  • Hi Costas,

    I'm glad it worked out. I think if you check under Event Viewer you will probably see some authentication some errors for the services when you were running under the launch script and did not provide the password.

    Anh

    Thursday, March 24, 2016 5:46 PM