none
HPC cluster very slow

    Question

  • Hello, I have set up a small cluster with 1 head node and 3 compute nodes. I have a Windows 2016 Server which I use to submit Workbook offloading jobs. My problem - the HPC is extremely slow; if I run the job on my local machine, it runs faster than on the HPC cluster.

    I think it could be because my nodes are not advanced enough (head node is 1vCPU, 2GB RAM, comp nodes are 1GB RAM each). My client machine is 2vCPU, 4GB. Does anyone here think that I need to increase the RAM of my nodes?

    I also have a suspicion that the HPC cluster is not using all the nodes to run the jobs. I submit the job without any job template assuming that by default it will use all the nodes. Is there any way to check that?

    There could well be other reasons that I haven't thought about. Any help to make the cluster run faster will be greatly appreciated! Thanks in advance!


    • Edited by KMLN Thursday, August 17, 2017 11:00 AM Removed unnecessary code that automatically appears
    Thursday, August 17, 2017 10:58 AM

Answers

  • hi KMLN,

      First 1vCPU and 2GB RAM is too small for production, it might only be feasable for dev/test.

      by default, the offload job will use *-* for resource thus the scheduler will auto calcuate how many resources needed by the job. When the job is running, you can check the Current Allocation to see how many resource is allocated to the job.


    Qiufang Shi

    • Marked as answer by KMLN Friday, August 18, 2017 10:58 AM
    Friday, August 18, 2017 9:19 AM

All replies

  • Any thoughts anyone?
    Friday, August 18, 2017 8:41 AM
  • hi KMLN,

      First 1vCPU and 2GB RAM is too small for production, it might only be feasable for dev/test.

      by default, the offload job will use *-* for resource thus the scheduler will auto calcuate how many resources needed by the job. When the job is running, you can check the Current Allocation to see how many resource is allocated to the job.


    Qiufang Shi

    • Marked as answer by KMLN Friday, August 18, 2017 10:58 AM
    Friday, August 18, 2017 9:19 AM
  • Ok great, I increased the capacity of the head node to 2vCPU and 8GB RAM and compute nodes to 2vCPU and 4GB RAM. The performance seems to have improved significantly. Thanks Quifang
    Friday, August 18, 2017 10:58 AM
  • Hi Quifang - Increasing the vCPU and RAM did resolve the problem, but now I am facing another problem. The job starts of well, but with time it slows to a 10th of the starting speed. Can you share some thoughts on what the issue could be? I have checked, my head node is 25% utilised (CPU) and compute nodes are 10%. Even the client machine is around 30% utilised.

    Thanks in advance!

    Wednesday, August 23, 2017 1:11 PM
  • but with time it slows to a 10th of the starting speed. -- how did you measure this?

    And when the job is executing, how many instance on Excel are you seeing? And how much CPU they consumed?

    And could you share what it is doing at the client side when job is running?


    Qiufang Shi

    Thursday, August 24, 2017 2:04 AM
  • I just noticed that to be the case .. initially my model was doing approx 10 calculations per second, later just 1 or 2 a sec.

    Can you please explain how to find out about 'how many instance on Excel are you seeing and how many CPU they consumed'?

    Similarly I didn't understand what you mean by 'what it is doing at the client side when job is running'?

    Friday, August 25, 2017 12:04 AM