locked
HPL performance degredation when run over all cores in a single node RRS feed

  • Question

  • Hi All
    I've been running some Linpack jobs to compare single node performance between 2003 CCS and HPC Server 2008, and have obtained some odd results. Output is much as expected when running over <max cores, but when run over all cores in a node the results from HPC Server 2008 are very poor indeed. I've included result data below for reference
    For info I'm running HPL V2 compiled against Intel MKL / HPC Server 2008 mpi libraries.
    If anyone has seen this type of behaviour and has hints or tips I'd appreciate some assistance.
    Thanks in advance
    Dan

    Results are in GFlops, 2003 CCS numbers are for reference (hardware is identical - dual socket, 8 core server, 16GB RAM).
    6 cores (P=2 Q=3)
    N           2003 CCS     HPC 2008
    280        0.28             0.35
    2800      8.78             26.46
    5600      14.05           34.62
    11200    19.88           42.28
    22400    32.64           47.95
    42344    47.56           51.37

    7 cores (P=1 Q=7)
    N           2003 CCS     HPC 2008
    280        0.41             0.49
    2800      8.71             31.73
    5600      14.30           41.49
    11200    19.49           50.20
    22400    29.89           56.48
    42344    42.25           60.32

    8 cores (P=2 Q=4)
    N           2003 CCS     HPC 2008
    280        0.06             0.003094
    2800      7.92             0.22
    5600      13.01           0.74
    11200    18.17           1.09
    22400    28.60           3.08
    42344    41.46           6.43
    • Moved by parmita mehtaModerator Thursday, June 25, 2009 5:37 PM job scheduler related (From:Windows HPC Server Deployment, Management, and Administration)
    Thursday, June 18, 2009 9:26 AM

Answers

  • Doh, got it! Using the -affinity mpiexec flag makes everything run a whole lot sweeter.

    8 cores (P=2 Q=4)
    N                    HPC 2008
    280                   0.48
    2800                 30.43
    5600                 41.01
    11200               52.26
    22400               60.73
    42344               66.11

    Note to self - the Windows HPC Team blog contains all sorts of useful information :)

    Thanks for the help anyway
    Dan
    Thursday, June 25, 2009 11:18 AM

All replies

  • Hi Dan

    That is indeed really odd. The first thing i would look at is if there is something else untoward going on with the machine whilst that test is running ( the system paging to disk for some reason maybe ). You could probably get a reasonable feel for this through task manager / resource monitor, but if you want a more detailed view you could run the xperf tools on the machine whilst the test is running:

    download the windows performance toolkit from http://msdn.microsoft.com/en-us/performance/default.aspx, and install it ( to c:\xperf say )
    from an admin command prompt:
    xperf -on latency
    < run your scenario>
    xperf -stop -d trace.etl
    xperf trace.etl

    This should bring up a UI with a fairly detailed breakdown of what the machine was doing. What you are looking for is 100% cpu in the HPL processes, and no other significant on box activity ( disk I/O , hardfaults etc ).

    cheers
    jeff

    Friday, June 19, 2009 2:46 PM
  • Hi Jeff 
    I've not had a chance to carry out your suggestions yet, but I thought I'd say thanks for the response. I'll hopefully be able to move this forward this week so I'll let you know how it goes.
    Cheers
    Dan
    Tuesday, June 23, 2009 9:00 AM
  • Hmmm, all looks normal.
    Certainly no processes eating processor time other than HPL, disk I/O is minimal, no hardfaults. Pretty much as I'd expect to see.
    I'm going to try some other tests to see how the node behaves when running alternative codes, and run HPL against some different hardware. If it's not a problem when real calculations are being carried out I'm not too bothered, but it would be nice to get to the bottom of what's going on...
    thanks
    Dan
    Thursday, June 25, 2009 9:31 AM
  • Doh, got it! Using the -affinity mpiexec flag makes everything run a whole lot sweeter.

    8 cores (P=2 Q=4)
    N                    HPC 2008
    280                   0.48
    2800                 30.43
    5600                 41.01
    11200               52.26
    22400               60.73
    42344               66.11

    Note to self - the Windows HPC Team blog contains all sorts of useful information :)

    Thanks for the help anyway
    Dan
    Thursday, June 25, 2009 11:18 AM
  • I'm glad you figrued it out, Dan!

    As a side note, we actually expect significantly better performance on Windows HPC Server 2008, since 2008's MS-MPI stack includes some very significant optimizations to the shared memory stack (whihc has a huge impact when running multiple MPI ranks on a particular node).

    Thanks!
    Josh
    -Josh
    Thursday, June 25, 2009 6:43 PM
    Moderator
  • Hi Dan

    Yes i'd like to get to the bottom of it too. what type of machines are these; anything special about the setup?

    the next thing i would do is run with etw tracing on during an 8 core run, and see if something jumps out there. To do this:

    1) download the windows performance toolkit from here http://msdn.microsoft.com/en-us/performance/cc752957.aspx
    2) install into some easily accessible folder on your node ( i use c:\xperf )
    3) launch an admin command prompt and run xperf -on latency+drivers before your run starts
    4) repro the issue for a while ( couple of minutes say )
    5) run xperf -stop -d trace.etl to dump the etw log
    6 ) run xperf trace.etl to view what was happening during the time

    What might be interesting is to see if there were any obvious driver delays that might acount for the slow perf, or if anything shows up when you select an area of the cpu graph and rhm, summary. If nothing obvious jumps out,we could upload the traces here ( to MS ) and take a look at them here also if it would help?

    cheers
    jeff
    Friday, June 26, 2009 6:34 AM
  • Sorry, missed the fact you'd answered your own question, and that affinity fixed it...

    Its actually interesting that affinity made that much of a difference as it shouldnt have anything like the effect of improving things 10x. What might be interesting if you have time for another experiment is to run without affinity and monitor context switches per second. If this is absolutely astronomical ( say > 1 million or something silly ), then that might be the root cause of your problem. I have seen this before when building linpack against the wrong version of the intel maths libraries ( the multi-threaded version ) for example.

    cheers
    jeff
    Friday, June 26, 2009 7:25 AM
  • Hi Jeff, thanks for the follow up.
    I've run a couple of additional tests, and have taken advantage of the Lizard download. What's clear is that my linpack build is the likely cause. I'm seeing an average of >1 million context switches/sec when running my original hpl build. When I repeat the job, but with the xhplmkl.exe build deployed as part of Lizard, context switches are much more reasonable (around 500) even when running without the affinity switch, and results are as expected. I don't think, however, that I've compiled against the multi-threaded version of Intel MKL.
    At least the cuplrit has come to the fore, and unfortunately it's my compiler skills which seem to be lacking :)
    Thanks again for your assistance, it's much appreciated.
    Dan 

    Friday, June 26, 2009 2:43 PM