Ressourcen für IT-Professionals >
Forenhomepage
>
Windows HPC Server Job Submission and Scheduling
>
HPL performance degredation when run over all cores in a single node
HPL performance degredation when run over all cores in a single node
- Hi AllI've been running some Linpack jobs to compare single node performance between 2003 CCS and HPC Server 2008, and have obtained some odd results. Output is much as expected when running over <max cores, but when run over all cores in a node the results from HPC Server 2008 are very poor indeed. I've included result data below for referenceFor info I'm running HPL V2 compiled against Intel MKL / HPC Server 2008 mpi libraries.If anyone has seen this type of behaviour and has hints or tips I'd appreciate some assistance.Thanks in advanceDanResults are in GFlops, 2003 CCS numbers are for reference (hardware is identical - dual socket, 8 core server, 16GB RAM).6 cores (P=2 Q=3)N 2003 CCS HPC 2008280 0.28 0.352800 8.78 26.465600 14.05 34.6211200 19.88 42.2822400 32.64 47.9542344 47.56 51.377 cores (P=1 Q=7)N 2003 CCS HPC 2008280 0.41 0.492800 8.71 31.735600 14.30 41.4911200 19.49 50.2022400 29.89 56.4842344 42.25 60.328 cores (P=2 Q=4)N 2003 CCS HPC 2008280 0.06 0.0030942800 7.92 0.225600 13.01 0.7411200 18.17 1.0922400 28.60 3.0842344 41.46 6.43
- Verschobenparmita mehtaModeratorDonnerstag, 25. Juni 2009 17:37job scheduler related (From:Windows HPC Server Deployment, Management, and Administration)
Antworten
- Doh, got it! Using the -affinity mpiexec flag makes everything run a whole lot sweeter.8 cores (P=2 Q=4)N HPC 2008280 0.482800 30.435600 41.0111200 52.2622400 60.7342344 66.11Note to self - the Windows HPC Team blog contains all sorts of useful information :)Thanks for the help anywayDan
- Als Antwort markiertJosh BarnardMSFT, BesitzerDonnerstag, 25. Juni 2009 18:41
Alle Antworten
- Hi Dan
That is indeed really odd. The first thing i would look at is if there is something else untoward going on with the machine whilst that test is running ( the system paging to disk for some reason maybe ). You could probably get a reasonable feel for this through task manager / resource monitor, but if you want a more detailed view you could run the xperf tools on the machine whilst the test is running:
download the windows performance toolkit from http://msdn.microsoft.com/en-us/performance/default.aspx, and install it ( to c:\xperf say )
from an admin command prompt:
xperf -on latency
< run your scenario>
xperf -stop -d trace.etl
xperf trace.etl
This should bring up a UI with a fairly detailed breakdown of what the machine was doing. What you are looking for is 100% cpu in the HPL processes, and no other significant on box activity ( disk I/O , hardfaults etc ).
cheers
jeff - Hi JeffI've not had a chance to carry out your suggestions yet, but I thought I'd say thanks for the response. I'll hopefully be able to move this forward this week so I'll let you know how it goes.CheersDan
- Hmmm, all looks normal.Certainly no processes eating processor time other than HPL, disk I/O is minimal, no hardfaults. Pretty much as I'd expect to see.I'm going to try some other tests to see how the node behaves when running alternative codes, and run HPL against some different hardware. If it's not a problem when real calculations are being carried out I'm not too bothered, but it would be nice to get to the bottom of what's going on...thanksDan
- Doh, got it! Using the -affinity mpiexec flag makes everything run a whole lot sweeter.8 cores (P=2 Q=4)N HPC 2008280 0.482800 30.435600 41.0111200 52.2622400 60.7342344 66.11Note to self - the Windows HPC Team blog contains all sorts of useful information :)Thanks for the help anywayDan
- Als Antwort markiertJosh BarnardMSFT, BesitzerDonnerstag, 25. Juni 2009 18:41
- I'm glad you figrued it out, Dan!
As a side note, we actually expect significantly better performance on Windows HPC Server 2008, since 2008's MS-MPI stack includes some very significant optimizations to the shared memory stack (whihc has a huge impact when running multiple MPI ranks on a particular node).
Thanks!
Josh
-Josh - Hi Dan
Yes i'd like to get to the bottom of it too. what type of machines are these; anything special about the setup?
the next thing i would do is run with etw tracing on during an 8 core run, and see if something jumps out there. To do this:
1) download the windows performance toolkit from here http://msdn.microsoft.com/en-us/performance/cc752957.aspx
2) install into some easily accessible folder on your node ( i use c:\xperf )
3) launch an admin command prompt and run xperf -on latency+drivers before your run starts
4) repro the issue for a while ( couple of minutes say )
5) run xperf -stop -d trace.etl to dump the etw log
6 ) run xperf trace.etl to view what was happening during the time
What might be interesting is to see if there were any obvious driver delays that might acount for the slow perf, or if anything shows up when you select an area of the cpu graph and rhm, summary. If nothing obvious jumps out,we could upload the traces here ( to MS ) and take a look at them here also if it would help?
cheers
jeff - Sorry, missed the fact you'd answered your own question, and that affinity fixed it...
Its actually interesting that affinity made that much of a difference as it shouldnt have anything like the effect of improving things 10x. What might be interesting if you have time for another experiment is to run without affinity and monitor context switches per second. If this is absolutely astronomical ( say > 1 million or something silly ), then that might be the root cause of your problem. I have seen this before when building linpack against the wrong version of the intel maths libraries ( the multi-threaded version ) for example.
cheers
jeff - Hi Jeff, thanks for the follow up.I've run a couple of additional tests, and have taken advantage of the Lizard download. What's clear is that my linpack build is the likely cause. I'm seeing an average of >1 million context switches/sec when running my original hpl build. When I repeat the job, but with the xhplmkl.exe build deployed as part of Lizard, context switches are much more reasonable (around 500) even when running without the affinity switch, and results are as expected. I don't think, however, that I've compiled against the multi-threaded version of Intel MKL.At least the cuplrit has come to the fore, and unfortunately it's my compiler skills which seem to be lacking :)Thanks again for your assistance, it's much appreciated.Dan

