none
mpiexec.exe MPI Error in Azure Batch - Aborting: smpd on RD0003FF98EE42 failed to communicate with child smpd manager RRS feed

  • Question

  • Hi,

    i am Getting Error while running MPI program using mpiexec.exe in Azure Batch multi-instance tasking - Batch MPI

    below is the complete debug output. any help would be appreciated<g class="gr_ gr_303 gr-alert gr_gramm gr_inline_cards gr_run_anim Punctuation multiReplace" data-gr-id="303" id="303">..</g>

    [00:1952] creating connect command to '100.104.58.44'
    [00:1952] posting command SMPD_CONNECT to left child, src=0, dest=1.
    [00:1952] host 100.104.42.32 is not connected yet
    [00:1952] Handling cmd=SMPD_CONNECT result
    [00:1952] cmd=SMPD_CONNECT result will be handled locally
    [00:1952] successful connect to 100.104.42.32.
    [00:1952] creating connect command for left node
    [00:1952] creating connect command to '100.104.36.14'
    [00:1952] posting command SMPD_CONNECT to left child, src=0, dest=2.
    [00:1952] host 100.104.58.44 is not connected yet
    [00:1952] Handling cmd=SMPD_CONNECT result
    [00:1952] cmd=SMPD_CONNECT result will be handled locally
    [00:1952] successful connect to 100.104.58.44.
    [00:1952] creating connect command for left node
    [00:1952] creating connect command to '100.104.40.33'
    [00:1952] posting command SMPD_CONNECT to left child, src=0, dest=3.
    [00:1952] host 100.104.36.14 is not connected yet
    [00:1952] Handling cmd=SMPD_CONNECT result
    [00:1952] cmd=SMPD_CONNECT result will be handled locally
    [00:1952] successful connect to 100.104.36.14.
    [00:1952] host 100.104.40.33 is not connected yet
    [00:1952] Handling cmd=SMPD_CONNECT result
    [00:1952] cmd=SMPD_CONNECT result will be handled locally
    [00:1952] successful connect to 100.104.40.33.
    [00:1952] posting command SMPD_COLLECT to left child, src=0, dest=1.
    [00:1952] posting command SMPD_COLLECT to left child, src=0, dest=2.
    [00:1952] posting command SMPD_COLLECT to left child, src=0, dest=3.
    [00:1952] posting command SMPD_COLLECT to left child, src=0, dest=4.
    [00:1952] posting command SMPD_COLLECT to left child, src=0, dest=5.
    [00:1952] Handling cmd=SMPD_COLLECT result
    [00:1952] cmd=SMPD_COLLECT result will be handled locally
    [00:1952] Handling cmd=SMPD_COLLECT result
    [00:1952] cmd=SMPD_COLLECT result will be handled locally
    [00:1952] Handling cmd=SMPD_COLLECT result
    [00:1952] cmd=SMPD_COLLECT result will be handled locally
    [00:1952] Handling cmd=SMPD_COLLECT result
    [00:1952] cmd=SMPD_COLLECT result will be handled locally
    [00:1952] Handling cmd=SMPD_COLLECT result
    [00:1952] cmd=SMPD_COLLECT result will be handled locally
    [00:1952] Finished collecting hardware summary.
    [00:1952] posting command SMPD_STARTDBS to left child, src=0, dest=1.
    [00:1952] Handling cmd=SMPD_STARTDBS result
    [00:1952] cmd=SMPD_STARTDBS result will be handled locally
    [00:1952] start_dbs succeeded, kvs_name: '24867516-e62e-4edb-b682-594de53e15c5', domain_name: '030367df-c5fd-4ccd-bcb6-0a4a56730821'
    [00:1952] creating a process group of size 5 on node 0 called 24867516-e62e-4edb-b682-594de53e15c5
    [00:1952] launching the processes.
    [00:1952] posting command SMPD_LAUNCH to left child, src=0, dest=1.
    [00:1952] posting command SMPD_LAUNCH to left child, src=0, dest=2.
    [00:1952] posting command SMPD_LAUNCH to left child, src=0, dest=3.
    [00:1952] posting command SMPD_LAUNCH to left child, src=0, dest=4.
    [00:1952] posting command SMPD_LAUNCH to left child, src=0, dest=5.
    [00:1952] Handling cmd=SMPD_LAUNCH result
    [00:1952] cmd=SMPD_LAUNCH result will be handled locally
    [00:1952] successfully launched process 0
    [00:1952] root process launched, starting stdin redirection.
    [00:1952] Handling cmd=SMPD_LAUNCH result
    [00:1952] cmd=SMPD_LAUNCH result will be handled locally
    [00:1952] successfully launched process 1
    [00:1952] Handling cmd=SMPD_LAUNCH result
    [00:1952] cmd=SMPD_LAUNCH result will be handled locally
    [00:1952] successfully launched process 2
    [00:1952] Unable to get the stdin handle.
    [00:1952] stdin to mpiexec closed. sending stdin_close command.
    [00:1952] posting command SMPD_STDIN_CLOSE to left child, src=0, dest=1.
    [00:1952] Handling cmd=SMPD_STDIN_CLOSE result
    [00:1952] cmd=SMPD_STDIN_CLOSE result will be handled locally
    [00:1952] Handling cmd=SMPD_LAUNCH result
    [00:1952] cmd=SMPD_LAUNCH result will be handled locally
    [00:1952] successfully launched process 3
    [00:1952] Handling cmd=SMPD_LAUNCH result
    [00:1952] cmd=SMPD_LAUNCH result will be handled locally
    [00:1952] successfully launched process 4
    [00:1952] Authentication completed. Successfully obtained Context for Client.
    [00:1952] Authorization completed.
    [00:1952] handling command SMPD_ABORT src=1

    Aborting: <g class="gr_ gr_315 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" data-gr-id="315" id="315">smpd</g> on RD0003FF98EE42 failed to communicate with child <g class="gr_ gr_316 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" data-gr-id="316" id="316">smpd</g> manager
    [00:1952] <g class="gr_ gr_317 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" data-gr-id="317" id="317">smpd</g> manager successfully stopped listening.

    Thanks,

    Kiran.

    Monday, April 30, 2018 6:37 AM

All replies

  • Is one of the MPI processes failing and exiting somehow? 
    Log:
    [00:1952] handling command SMPD_ABORT src=1

    Also, do you see any errors in smpd logs (smpd -d 3).

    Thursday, May 3, 2018 11:21 PM