none
Local machine 10x faster than HPC cluster!! RRS feed

  • Question

  • Hello, I have set up a small cluster with 1 head node and 3 compute nodes. My client machine is a Windows 2016 Server which I use to submit Workbook offloading jobs. My problem - the HPC is extremely slow; if I run the job on my local machine, it runs faster than on the HPC cluster ... about 10 times faster! The configuration of my nodes is as follows:

    Headnode: 2vCPU and RAM 8GB

    Compute nodes: 1vCPU and RAM 4GB each

    I have a suspicious the issue could be with the communication between the nodes and the network. Or something entirely different. Can someone please help?

    Thanks in advance!

    Friday, August 25, 2017 2:32 PM

Answers

  • Looking at the symptom below:

    - 1 hr 30 mins on client and 2 hr 30 mins on compute node if start manually

    - It can't finish at 1 hr and 30 mins on HPC Cluster

    To identify the bottleneck when running on cluster, you need to understand or do more testing around:

    1. Whether you can finish the job in 2 hr 30 mins with only one compute node -- this will tell you the HPC infra didn't add more overheads. If this took much longer than 2 hr 30 mins, it means below possible bottleneck:

    - the read/write from a share is a big overhead. As when running workbook in the HPC Cluster, the computenode will launch the spreadsheet from the share and write back to the share.

    - the network between the share and compute node is really poor, and the computenode takes lots of time to read/write

    2. "With time it slowed down to a crawl, just 2-3 iterations per second."

    - Why this happens, to understand this, this might be a core. Any dependencies in your execution, whether an iteration requires the result from output of other compute? This will dramatically slow down your calculation.

    - If possible, you can share your workbook to us and we can take a check (hpcpack@microsoft.com, and of course, you can remove sensitive data before sending to us)


    Qiufang Shi

    • Marked as answer by KMLN Sunday, September 3, 2017 9:23 AM
    Tuesday, August 29, 2017 2:25 AM

All replies

  • Hi KMLN,

      what's your HW configuration for your client? For Workbook offloading, network shouldn't be a big bottleneck. By looking at your Compute node configuration, 1vCPU is really small.

    You shall have similar HW configuration between the compute node and your client machine to give a fair compare. And Please also take consideration on the time it used to launch the Excel.

      You shall have a test about this: log on the a compute node you have now, start excel and kick of the same workbook calculation to see how long it takes. And then comparing to the time it took when it is run through the scheduler.


    Qiufang Shi

    Monday, August 28, 2017 2:33 AM
  • Thanks Qiufang - My client machine is 2vCPU, 7.5 GB RAM.

    I ran my spreadsheet on one of the compute nodes locally. It is a bit slower than my client machine, but not that slow to slow down the entire cluster.

    I agree the configuration of my compute nodes is low, but I thought that would be enough to run a simple spreadsheet.

    The calculations on HPC started out well, around 60 iterations per second, but with time it slowed down to a crawl, just 2-3 iterations per second.

    In summary, the job on my client machine took 1 hr 30 mins; it took around 2 hr 30 mins on a compute node and the job was still not complete after 1 hr and 30 mins on the HPC cluster when I had to cancel it.

    Monday, August 28, 2017 12:18 PM
  • Looking at the symptom below:

    - 1 hr 30 mins on client and 2 hr 30 mins on compute node if start manually

    - It can't finish at 1 hr and 30 mins on HPC Cluster

    To identify the bottleneck when running on cluster, you need to understand or do more testing around:

    1. Whether you can finish the job in 2 hr 30 mins with only one compute node -- this will tell you the HPC infra didn't add more overheads. If this took much longer than 2 hr 30 mins, it means below possible bottleneck:

    - the read/write from a share is a big overhead. As when running workbook in the HPC Cluster, the computenode will launch the spreadsheet from the share and write back to the share.

    - the network between the share and compute node is really poor, and the computenode takes lots of time to read/write

    2. "With time it slowed down to a crawl, just 2-3 iterations per second."

    - Why this happens, to understand this, this might be a core. Any dependencies in your execution, whether an iteration requires the result from output of other compute? This will dramatically slow down your calculation.

    - If possible, you can share your workbook to us and we can take a check (hpcpack@microsoft.com, and of course, you can remove sensitive data before sending to us)


    Qiufang Shi

    • Marked as answer by KMLN Sunday, September 3, 2017 9:23 AM
    Tuesday, August 29, 2017 2:25 AM
  • Hi Quifang - Thanks for your reply. I made certain changes to my spreadsheet, and most importantly, saved it as a binary (xlsb) and that seems to have resolved all the issues. I still don't know why the HPC cluster was slowing down, but I am glad I am no longer facing the issues. Many thanks for your help!
    Sunday, September 3, 2017 9:23 AM