Answered by:
HPL performance degredation when run over all cores in a single node

Question
-
Hi AllI've been running some Linpack jobs to compare single node performance between 2003 CCS and HPC Server 2008, and have obtained some odd results. Output is much as expected when running over <max cores, but when run over all cores in a node the results from HPC Server 2008 are very poor indeed. I've included result data below for referenceFor info I'm running HPL V2 compiled against Intel MKL / HPC Server 2008 mpi libraries.If anyone has seen this type of behaviour and has hints or tips I'd appreciate some assistance.Thanks in advanceDanResults are in GFlops, 2003 CCS numbers are for reference (hardware is identical - dual socket, 8 core server, 16GB RAM).6 cores (P=2 Q=3)N 2003 CCS HPC 2008280 0.28 0.352800 8.78 26.465600 14.05 34.6211200 19.88 42.2822400 32.64 47.9542344 47.56 51.377 cores (P=1 Q=7)N 2003 CCS HPC 2008280 0.41 0.492800 8.71 31.735600 14.30 41.4911200 19.49 50.2022400 29.89 56.4842344 42.25 60.328 cores (P=2 Q=4)N 2003 CCS HPC 2008280 0.06 0.0030942800 7.92 0.225600 13.01 0.7411200 18.17 1.0922400 28.60 3.0842344 41.46 6.43
- Moved by parmita mehtaModerator Thursday, June 25, 2009 5:37 PM job scheduler related (From:Windows HPC Server Deployment, Management, and Administration)
Thursday, June 18, 2009 9:26 AM
Answers
-
Doh, got it! Using the -affinity mpiexec flag makes everything run a whole lot sweeter.8 cores (P=2 Q=4)N HPC 2008280 0.482800 30.435600 41.0111200 52.2622400 60.7342344 66.11Note to self - the Windows HPC Team blog contains all sorts of useful information :)Thanks for the help anywayDan
- Marked as answer by Josh BarnardModerator Thursday, June 25, 2009 6:41 PM
Thursday, June 25, 2009 11:18 AM
All replies
-
Hi Dan
That is indeed really odd. The first thing i would look at is if there is something else untoward going on with the machine whilst that test is running ( the system paging to disk for some reason maybe ). You could probably get a reasonable feel for this through task manager / resource monitor, but if you want a more detailed view you could run the xperf tools on the machine whilst the test is running:
download the windows performance toolkit from http://msdn.microsoft.com/en-us/performance/default.aspx, and install it ( to c:\xperf say )
from an admin command prompt:
xperf -on latency
< run your scenario>
xperf -stop -d trace.etl
xperf trace.etl
This should bring up a UI with a fairly detailed breakdown of what the machine was doing. What you are looking for is 100% cpu in the HPL processes, and no other significant on box activity ( disk I/O , hardfaults etc ).
cheers
jeffFriday, June 19, 2009 2:46 PM -
Hi JeffI've not had a chance to carry out your suggestions yet, but I thought I'd say thanks for the response. I'll hopefully be able to move this forward this week so I'll let you know how it goes.CheersDanTuesday, June 23, 2009 9:00 AM
-
Hmmm, all looks normal.Certainly no processes eating processor time other than HPL, disk I/O is minimal, no hardfaults. Pretty much as I'd expect to see.I'm going to try some other tests to see how the node behaves when running alternative codes, and run HPL against some different hardware. If it's not a problem when real calculations are being carried out I'm not too bothered, but it would be nice to get to the bottom of what's going on...thanksDanThursday, June 25, 2009 9:31 AM
-
Doh, got it! Using the -affinity mpiexec flag makes everything run a whole lot sweeter.8 cores (P=2 Q=4)N HPC 2008280 0.482800 30.435600 41.0111200 52.2622400 60.7342344 66.11Note to self - the Windows HPC Team blog contains all sorts of useful information :)Thanks for the help anywayDan
- Marked as answer by Josh BarnardModerator Thursday, June 25, 2009 6:41 PM
Thursday, June 25, 2009 11:18 AM -
I'm glad you figrued it out, Dan!
As a side note, we actually expect significantly better performance on Windows HPC Server 2008, since 2008's MS-MPI stack includes some very significant optimizations to the shared memory stack (whihc has a huge impact when running multiple MPI ranks on a particular node).
Thanks!
Josh
-JoshThursday, June 25, 2009 6:43 PMModerator -
Hi Dan
Yes i'd like to get to the bottom of it too. what type of machines are these; anything special about the setup?
the next thing i would do is run with etw tracing on during an 8 core run, and see if something jumps out there. To do this:
1) download the windows performance toolkit from here http://msdn.microsoft.com/en-us/performance/cc752957.aspx
2) install into some easily accessible folder on your node ( i use c:\xperf )
3) launch an admin command prompt and run xperf -on latency+drivers before your run starts
4) repro the issue for a while ( couple of minutes say )
5) run xperf -stop -d trace.etl to dump the etw log
6 ) run xperf trace.etl to view what was happening during the time
What might be interesting is to see if there were any obvious driver delays that might acount for the slow perf, or if anything shows up when you select an area of the cpu graph and rhm, summary. If nothing obvious jumps out,we could upload the traces here ( to MS ) and take a look at them here also if it would help?
cheers
jeffFriday, June 26, 2009 6:34 AM -
Sorry, missed the fact you'd answered your own question, and that affinity fixed it...
Its actually interesting that affinity made that much of a difference as it shouldnt have anything like the effect of improving things 10x. What might be interesting if you have time for another experiment is to run without affinity and monitor context switches per second. If this is absolutely astronomical ( say > 1 million or something silly ), then that might be the root cause of your problem. I have seen this before when building linpack against the wrong version of the intel maths libraries ( the multi-threaded version ) for example.
cheers
jeffFriday, June 26, 2009 7:25 AM -
Hi Jeff, thanks for the follow up.I've run a couple of additional tests, and have taken advantage of the Lizard download. What's clear is that my linpack build is the likely cause. I'm seeing an average of >1 million context switches/sec when running my original hpl build. When I repeat the job, but with the xhplmkl.exe build deployed as part of Lizard, context switches are much more reasonable (around 500) even when running without the affinity switch, and results are as expected. I don't think, however, that I've compiled against the multi-threaded version of Intel MKL.At least the cuplrit has come to the fore, and unfortunately it's my compiler skills which seem to be lacking :)Thanks again for your assistance, it's much appreciated.DanFriday, June 26, 2009 2:43 PM