none
Dispatching tasks takes very long

    Question

  • We have upgraded our HPC Pack (from 2008 R2) to 2016 recently and since then, it takes unusually long until the jobs actually run. Our application scenario is a bit unusual, because we use the cluster to drive a tiled display, ie there is normally exactly one job running on demand.

    The behaviour we see is that the cluster manager basically instantly detects that the resources are available and the job can run. The job itself therefore changes to the Running state in a matter of seconds. However, the only task (an mpiexec) enters the Dispatching state and it takes several minutes until it finally is running. This is a major annoyance, because the users are actively waiting for the interactive job.

    Another unusual effect we observe is that the two cmd prompts (MPI and the application) do not open at the same time on all the nodes, but some are taking significantly longer than others. So the problem could be related to the network, although the diagnostics like MPI ping etc. look good.

    The questions are:

    1. How would I diagnose this issue, ie are there log files that would allow me to see what the scheduler is actually doing and what takes so long?

    2. What is the expected time it should take to actually start the executable on an empty cluster?

    3. How can I make it faster (30 seconds would be OK, but not more than three minutes)?

    Thanks in advance,
    Christoph

    Monday, 25 June 2018 4:38 PM

All replies

  • It shouldn't be that long, could you first examine the logs yourself?

    The scheduler related logs are %CCP_Data%LogFiles\Scheduler\HPCScheduler_*.bin (the one with largest number is empty for placeholder)

    The compute node side logs are under  %CCP_Data%LogFiles\Scheduler\HPCnodeManager_*.bin

    you could use "LogParser.exe" to convert the bin to plaintext, or you could use this tool https://hpconlineservice.blob.core.windows.net/logviewer/LogViewer.UI.application which you may install vcredist to use it.

    If you can't find any clue, please share the log with us through hpcpack@microsoft.com


    Qiufang Shi


    Tuesday, 26 June 2018 3:27 AM
  • That is helpful to know. However, I have two problems atm:

    1. It seems that the log is not continually written, so I need to wait for hours until my test jobs become visible in the log viewer. Is it possible to speed up the persistence process?

    2. I am not completely sure what to look for. I think that the logs confirm that the resource allocation is practically instantly. I can see that the scheduler finds out that the nodes are free within a second.

    Furthermore, I was able to narrow down the issue a bit. The problem only seems to occur if two conditions are met: (i) the job uses mpiexec and (ii) the job uses an executbale on a file share. Using this information, I was able to push the startup time to around 30 s from 2 min using the following hack: If I add a node prep task that does a net use \\sharethatiuse, the smpd seems to start much, much faster. The time from the smpd's cmd showing up to my executable starting does not change though. How could the use of a network share affect the startup time of the smpd?

    Best regards,
    Christoph

    Thursday, 28 June 2018 6:43 PM
  • Hi Christoph,

      One question, how long did you see the Task state in "Dispatching" state? If it is in dispatching for minutes, then it is scheduling issue. Otherwise, it will be MSMPI related issue.

      Second question, to narrow down the issue, could you try mpipingpong to check whether it is a problem? -- This small test will help isolate whether it is an infra issue or related to you application.

      And whether you application is starting from a network share or from local machine?

      SMPD service in HPC Pack shall be a long running service, thus I didn't get your question of "startup time of the smpd".


    Qiufang Shi

    Friday, 29 June 2018 3:18 AM
  • Hi Qiufang,

    yes, the task is "Dispatching" for minutes (the job already shows "Running"). Once the task is also "Running", it takes a few seconds for my executable to show up, which is think is normal.

    I have already run mpipingpong and all other network tests. The results look OK in my opinion:

    Lower Bound (usecs) Upper Bound (usecs) Number of links measured within this interval
    13,429 19,434 195
    19,434 25,438 3
    25,438 31,443 1
    31,443 37,447 0
    37,447 43,452 2
    43,452 49,456 4
    49,456 55,461 2
    55,461 61,465 1
    61,465 67,470 0
    67,470 73,474 2

    Also, the other connectivity tests did not show any problem.

    The application is started from the network share itself which is also the working directory. I have not yet tried to deploy it to the local disk, because it is a bit nasty to setup, but I will try if I find the time. And I just want to repeat: A net use node prep finishes nearly instantly and speeds up the actual application a lot.

    Sorry for the confusion about the smpd. I meant the extra cmd that shows up before by own application starts. I just looked again at it, and it says something about msmpisvc in the title bar. It takes very long until this window pops up, once it is there, my own application shows up after a few seconds.

    Best regards,
    Christoph

    Friday, 29 June 2018 8:50 AM
  • I'm check with the MSMPI people to check what's wrong with smpd where it is stuck. Will update this thread once I got any clue

    Qiufang Shi

    Monday, 2 July 2018 12:57 AM
  • Christoph,

    you may want to use some simple executable to see if the long startup time is caused by network share access time: for the same executable from the network share and copied to every compute node, do you see any difference in job "dispatching" time?

    -thanks, Anna

    Tuesday, 3 July 2018 7:18 PM
  • Christoph,

    you may want to use some simple executable to see if the long startup time is caused by network share access time: for the same executable from the network share and copied to every compute node, do you see any difference in job "dispatching" time?

    -thanks, Anna

    Hi all,

    I had to wait some days to get an allocation to run the tests. I think I found something important using an extremely simple MPI programme only printing its rank. The takeaway is that the location (share or local SSD) of the executable and the location of the output files (stream redirection) is irrelevant. It always takes 2-3 min to dispatch.

    I then accidentally found out that the real problem is the HPC_ATTACHTOCONSOLE environment variable. We normally use that for all jobs, because we are using the GPU for rendering, but I forgot it at some point during testing. The job runs almost instantly under any condition if I do not set HPC_ATTACHTOCONSOLE. Can you make some reason of this?

    The really weird thing is the following: If I add a nodeprep task to the job invoking net use \\sharethatisnotused, the job is fast, even with HPC_ATTACHTOCONSOLE set. This effect is reproducible, ie it is no one-time effect. So the situation now is:

    HPC_ATTACHTOCONSOLE set: immediate allocation of nodes, task is in "Dispatching" state for 2-3 minutes. 

    HPC_ATTACHTOCONSOLE set + net use \\someshare as nodeprep: immediate allocation of nodes, task is in "Dispatching" up to 30 s.

    HPC_ATTACHTOCONSOLE not set: starts almost instantly

    I can somehow live with the net use hack, but obviously, the software can do much faster and I therefore would be very happy if we could be similarly fast with our interactive GPU jobs. Do you have any idea what I could try?

    Best regards,
    Christoph 

    Friday, 6 July 2018 2:42 PM
  • Just one additional observation I just made: The net use is not really required, the only important thing is that there is a nodeprep task which does anything. Also echo "Test" has the same effect.
    Friday, 6 July 2018 2:58 PM
  • Hi Christoph,

      Do you need job your both for GPU and MPI? If not, you shouldn't set HPC_ATTATCHTOCONSOLE for MPI job.

      We will take a check the issue you described and report back to this thread.

      


    Qiufang Shi

    Wednesday, 11 July 2018 3:19 AM
  • Hi Qiufang,

    unfortunately, this is exactly the scenario we need. As I wrote, we are running interactive applications on a large, tiled display. I need the GPU and I need MPI for my network communication. If you had some suggestions on improving the scenario, that would be greatly helpful - the scheduler itself is extremely fast, only this very combination we need is slow ...

    Best regards,
    Christoph

    Wednesday, 11 July 2018 1:04 PM
  • Hi Christoph,

      I understand now. We will have a local repro first.

      And from your description, HPC Pack 2008 R2 isn't a problem right? Only in HPC Pack 2016?


    Qiufang Shi

    Thursday, 12 July 2018 4:09 AM
  • Hi Christoph,

      Today I tested with two nodes using "mpiexec mpingpong" with "HPC_ATTATCHTOCONSOLE=TRUE" and the job finished with seconds. And all looks good to me. And I also tested with "HPC_CREATECONSOLE=TRUE" and the finished within seconds as well.

      Thus could you reach me through hpcpack@microsoft.com with:

    1. Tell me the HPC Pack version

    2. Create a job with job unit type "Node", resource 2-2 nodes, and add a task with 2-2 nodes, env "HPC_ATTATCHTOCONSOLE=TRUE" and commandline: mpiexec mpipingpong.exe

    3. Check whether the job runs for 2-3 minutes as it stuck at "Dispatching"

    4. If yes, share the NodeManager logs from those two nodes ( %CCP_DATA%LOGFILES\Scheduler\HpcNodeManager_*.bin)


    Qiufang Shi

    Friday, 13 July 2018 3:59 AM
  • Hi Christoph,

      I understand now. We will have a local repro first.

      And from your description, HPC Pack 2008 R2 isn't a problem right? Only in HPC Pack 2016?


    Qiufang Shi


    Yes. 2008 R2 was not as fast as 2015 without interactive session, but is was clearly below 30 s.
    Friday, 13 July 2018 11:42 AM
  • Hi Qiufang,

    I will try that asap, but one question in advance: how long do I have to wait until the NodeManager logs have been written to disk? This seems to take some time.

    Best regards,
    Christoph

    Friday, 13 July 2018 11:43 AM
  • I cannot reproduce the issue with mpipingpong. I am now investigating what the differences between this test and our actual applications are. Once I have any meaningful information, I will inform you.

    Best regards,
    Christoph

    Friday, 13 July 2018 3:39 PM
  • logs almost write to disks in real time.

    Qiufang Shi

    Monday, 16 July 2018 4:22 AM