HPL performance degredation when run over all cores in a single nodeHi All <div>I've been running some Linpack jobs to compare single node performance between 2003 CCS and HPC Server 2008, and have obtained some odd results. Output is much as expected when running over &lt;max cores, but when run over all cores in a node the results from HPC Server 2008 are very poor indeed. I've included result data below for reference</div> <div>For info I'm running HPL V2 compiled against Intel MKL / HPC Server 2008 mpi libraries.</div> <div>If anyone has seen this type of behaviour and has hints or tips I'd appreciate some assistance.</div> <div>Thanks in advance</div> <div>Dan</div> <div><br/></div> <div>Results are in GFlops, 2003 CCS numbers are for reference (hardware is identical - dual socket, 8 core server, 16GB RAM).</div> <div><col style="width:63pt" width=84></div> <div>6 cores (P=2 Q=3)</div> <div>N           2003 CCS     HPC 2008</div> <div>280        0.28             0.35</div> <div>2800      8.78             26.46</div> <div>5600      14.05           34.62</div> <div>11200    19.88           42.28</div> <div>22400    32.64           47.95</div> <div>42344    47.56           51.37</div> <div><br/></div> <div> <div>7 cores (P=1 Q=7)</div> <div>N           2003 CCS     HPC 2008</div> <div>280        0.41             0.49</div> <div>2800      8.71             31.73</div> <div>5600      14.30           41.49</div> <div>11200    19.49           50.20</div> <div>22400    29.89           56.48</div> <div>42344    42.25           60.32</div> <div><br/></div> <div> <div>8 cores (P=2 Q=4)</div> <div>N           2003 CCS     HPC 2008</div> <div>280        0.06             0.003094</div> <div>2800      7.92             0.22</div> <div>5600      13.01           0.74</div> <div>11200    18.17           1.09</div> <div>22400    28.60           3.08</div> <div>42344    41.46           6.43</div> </div> </div>© 2009 Microsoft Corporation. All rights reserved.Fri, 26 Jun 2009 14:43:32 Z50ce4822-0a10-44d5-bddf-02ee71e4261dhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#50ce4822-0a10-44d5-bddf-02ee71e4261dhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#50ce4822-0a10-44d5-bddf-02ee71e4261dDanAdamshttp://social.microsoft.com/Profile/en-US/?user=DanAdamsHPL performance degredation when run over all cores in a single nodeHi All <div>I've been running some Linpack jobs to compare single node performance between 2003 CCS and HPC Server 2008, and have obtained some odd results. Output is much as expected when running over &lt;max cores, but when run over all cores in a node the results from HPC Server 2008 are very poor indeed. I've included result data below for reference</div> <div>For info I'm running HPL V2 compiled against Intel MKL / HPC Server 2008 mpi libraries.</div> <div>If anyone has seen this type of behaviour and has hints or tips I'd appreciate some assistance.</div> <div>Thanks in advance</div> <div>Dan</div> <div><br/></div> <div>Results are in GFlops, 2003 CCS numbers are for reference (hardware is identical - dual socket, 8 core server, 16GB RAM).</div> <div><col style="width:63pt" width=84></div> <div>6 cores (P=2 Q=3)</div> <div>N           2003 CCS     HPC 2008</div> <div>280        0.28             0.35</div> <div>2800      8.78             26.46</div> <div>5600      14.05           34.62</div> <div>11200    19.88           42.28</div> <div>22400    32.64           47.95</div> <div>42344    47.56           51.37</div> <div><br/></div> <div> <div>7 cores (P=1 Q=7)</div> <div>N           2003 CCS     HPC 2008</div> <div>280        0.41             0.49</div> <div>2800      8.71             31.73</div> <div>5600      14.30           41.49</div> <div>11200    19.49           50.20</div> <div>22400    29.89           56.48</div> <div>42344    42.25           60.32</div> <div><br/></div> <div> <div>8 cores (P=2 Q=4)</div> <div>N           2003 CCS     HPC 2008</div> <div>280        0.06             0.003094</div> <div>2800      7.92             0.22</div> <div>5600      13.01           0.74</div> <div>11200    18.17           1.09</div> <div>22400    28.60           3.08</div> <div>42344    41.46           6.43</div> </div> </div>Thu, 18 Jun 2009 09:26:31 Z2009-06-18T09:26:31Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#20bf35bc-8edd-4137-b96b-7a4e4d74a9d2http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#20bf35bc-8edd-4137-b96b-7a4e4d74a9d2Jeff Baxterhttp://social.microsoft.com/Profile/en-US/?user=Jeff%20BaxterHPL performance degredation when run over all cores in a single nodeHi Dan<br/><br/>That is indeed really odd. The first thing i would look at is if there is something else untoward going on with the machine whilst that test is running ( the system paging to disk for some reason maybe ). You could probably get a reasonable feel for this through task manager / resource monitor, but if you want a more detailed view you could run the xperf tools on the machine whilst the test is running:<br/><br/>download the windows performance toolkit from <a href="http://msdn.microsoft.com/en-us/performance/default.aspx">http://msdn.microsoft.com/en-us/performance/default.aspx</a>, and install it ( to c:\xperf say )<br/>from an admin command prompt:<br/>xperf -on latency<br/>&lt; run your scenario&gt;<br/>xperf -stop -d trace.etl <br/>xperf trace.etl<br/><br/>This should bring up a UI with a fairly detailed breakdown of what the machine was doing. What you are looking for is 100% cpu in the HPL processes, and no other significant on box activity ( disk I/O , hardfaults etc ). <br/><br/>cheers<br/>jeff<br/><br/>Fri, 19 Jun 2009 14:46:37 Z2009-06-19T14:46:37Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#59a475cd-4e31-4c7a-8126-ed0f0b0cf646http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#59a475cd-4e31-4c7a-8126-ed0f0b0cf646DanAdamshttp://social.microsoft.com/Profile/en-US/?user=DanAdamsHPL performance degredation when run over all cores in a single nodeHi Jeff  <div>I've not had a chance to carry out your suggestions yet, but I thought I'd say thanks for the response. I'll hopefully be able to move this forward this week so I'll let you know how it goes.</div> <div>Cheers</div> <div>Dan</div>Tue, 23 Jun 2009 09:00:36 Z2009-06-23T09:00:36Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#a06e1934-3267-4cef-9ba0-06e86b73eb95http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#a06e1934-3267-4cef-9ba0-06e86b73eb95DanAdamshttp://social.microsoft.com/Profile/en-US/?user=DanAdamsHPL performance degredation when run over all cores in a single nodeHmmm, all looks normal. <div>Certainly no processes eating processor time other than HPL, disk I/O is minimal, no hardfaults. Pretty much as I'd expect to see.</div> <div>I'm going to try some other tests to see how the node behaves when running alternative codes, and run HPL against some different hardware. If it's not a problem when real calculations are being carried out I'm not too bothered, but it would be nice to get to the bottom of what's going on...</div> <div>thanks</div> <div>Dan</div>Thu, 25 Jun 2009 09:31:16 Z2009-06-25T09:31:16Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#ccede9e1-25ef-45d9-ac4a-935128e8f589http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#ccede9e1-25ef-45d9-ac4a-935128e8f589DanAdamshttp://social.microsoft.com/Profile/en-US/?user=DanAdamsHPL performance degredation when run over all cores in a single nodeDoh, got it! Using the -affinity mpiexec flag makes everything run a whole lot sweeter. <div><br/></div> <div> <div>8 cores (P=2 Q=4)</div> <div>N                    HPC 2008</div> <div>280                   0.48</div> <div>2800                 30.43</div> <div>5600                 41.01</div> <div>11200               52.26</div> <div>22400               60.73</div> <div>42344               66.11</div> <div><br/></div> <div>Note to self - the Windows HPC Team blog contains all sorts of useful information :)</div> <div><br/></div> <div>Thanks for the help anyway</div> <div>Dan</div> </div>Thu, 25 Jun 2009 11:18:39 Z2009-06-25T11:18:39Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#8604bc35-d4d9-4cc2-958e-06458988be12http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#8604bc35-d4d9-4cc2-958e-06458988be12Josh Barnardhttp://social.microsoft.com/Profile/en-US/?user=Josh%20BarnardHPL performance degredation when run over all cores in a single nodeI'm glad you figrued it out, Dan!<br/><br/>As a side note, we actually expect significantly better performance on Windows HPC Server 2008, since 2008's MS-MPI stack includes some very significant optimizations to the shared memory stack (whihc has a huge impact when running multiple MPI ranks on a particular node).<br/><br/>Thanks!<br/>Josh<hr class="sig">-JoshThu, 25 Jun 2009 18:43:15 Z2009-06-25T18:43:15Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#ac917666-add3-4af6-9632-9b25a6816bc8http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#ac917666-add3-4af6-9632-9b25a6816bc8jeff baxter .http://social.microsoft.com/Profile/en-US/?user=jeff%20baxter%20.HPL performance degredation when run over all cores in a single nodeHi Dan<br/><br/>Yes i'd like to get to the bottom of it too. what type of machines are these; anything special about the setup?<br/><br/>the next thing i would do is run with etw tracing on during an 8 core run, and see if something jumps out there. To do this:<br/><br/>1) download the windows performance toolkit from here <a href="http://msdn.microsoft.com/en-us/performance/cc752957.aspx">http://msdn.microsoft.com/en-us/performance/cc752957.aspx</a><br/>2) install into some easily accessible folder on your node ( i use c:\xperf )<br/>3) launch an admin command prompt and run xperf -on latency+drivers before your run starts<br/>4) repro the issue for a while ( couple of minutes say )<br/>5) run xperf -stop -d trace.etl to dump the etw log <br/>6 ) run xperf trace.etl to view what was happening during the time<br/><br/>What might be interesting is to see if there were any obvious driver delays that might acount for the slow perf, or if anything shows up when you select an area of the cpu graph and rhm, summary. If nothing obvious jumps out,we could upload the traces here ( to MS ) and take a look at them here also if it would help?<br/><br/>cheers<br/>jeffFri, 26 Jun 2009 06:34:13 Z2009-06-26T06:34:13Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#4b2c3ceb-536b-43f9-af19-07c0a244bd11http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#4b2c3ceb-536b-43f9-af19-07c0a244bd11jeff baxter .http://social.microsoft.com/Profile/en-US/?user=jeff%20baxter%20.HPL performance degredation when run over all cores in a single nodeSorry, missed the fact you'd answered your own question, and that affinity fixed it...<br/><br/>Its actually interesting that affinity made that much of a difference as it shouldnt have anything like the effect of improving things 10x. What might be interesting if you have time for another experiment is to run without affinity and monitor context switches per second. If this is absolutely astronomical ( say &gt; 1 million or something silly ), then that might be the root cause of your problem. I have seen this before when building linpack against the wrong version of the intel maths libraries ( the multi-threaded version ) for example.<br/><br/>cheers<br/>jeffFri, 26 Jun 2009 07:25:22 Z2009-06-26T07:25:22Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#11b11362-f6cc-4955-8d7d-f47195dff89dhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/50ce4822-0a10-44d5-bddf-02ee71e4261d#11b11362-f6cc-4955-8d7d-f47195dff89dDanAdamshttp://social.microsoft.com/Profile/en-US/?user=DanAdamsHPL performance degredation when run over all cores in a single nodeHi Jeff, thanks for the follow up. <div>I've run a couple of additional tests, and have taken advantage of the Lizard download. What's clear is that my linpack build is the likely cause. I'm seeing an average of &gt;1 million context switches/sec when running my original hpl build. When I repeat the job, but with the xhplmkl.exe build deployed as part of Lizard, context switches are much more reasonable (around 500) even when running without the affinity switch, and results are as expected. I don't think, however, that I've compiled against the multi-threaded version of Intel MKL.</div> <div>At least the cuplrit has come to the fore, and unfortunately it's my compiler skills which seem to be lacking :)</div> <div>Thanks again for your assistance, it's much appreciated.</div> <div>Dan </div> <div><br/></div>Fri, 26 Jun 2009 14:43:32 Z2009-06-26T14:43:32Z