none
Some nubie HPC performance questions

    Question

  • Hi all,

    I've recently configured a cluster on Windows 2008 SPC2 HPC (at the moment just a head node and a single compute node). The cluster validation done from HPC cluster manager went well. I then installed MATLAB on it, and did not pass the MATLAB cluster validation testing (failed on parallel jobs), leading me to look for potential configuration issues. To this end I've also ran Lizard testing, which notified me I have less than desired consistency (72.3%). A couple of real nubie questions:

    1. What may potentially cause consistency issues, and what is considered as a reasonable number?

    2. It seems Lizard counts 32 cores in my cluster, though I have 2 servers with two quad core processors in each, i.e., 16 cores. Is that a configuration error, or does this has to to with Hyperthreading.

    3. I am using my headnode as a compute node as well. Should I disable HT, provided I will mostly work with MATLAB. Are there general rules on when this is advisable, performance wise?

    4. To get Lizard's effciency numbers right, I need to put in # of floating-point operations per clock cycle per core. This seems to be an elusive number. I have Intel X5560 Nehalem processors (2.8GHz), and I've seen somewhere the number 90 GFlops per processor, which will translate to 8 op per cycle per core, but other places cite 4 as the sefault number for Nehalem. Provided I pass the consistency issue, and actually go through the performance testing, what is considered as a reasonable efficiency for a small cluster like mine?

    5. I was looking in the "charts and reports" tab in the cluster manager and noticed that the headnode availablity is down to 0 all these recent days. What does it mean? In the Node management tab, both nodes seem to be online with node health marked as OK.

    Thanks for being patient, any insight is greatly appreciated, Guy 

    Friday, May 21, 2010 10:06 PM

All replies

  • Hello Guy,

    I'll take a punt at your questions as listed, some may be answered with another question though ;)

    1. Is your head node and compute node hardware of the same type? Are you running any other services on your headnode which may effect performance results? What does this look like with hyperthreading turned off?

    2. If you have hyperthreading turned on then HPC Server will see all of he HT cores. From experience I suggest turning off Hyperthreading on all headnodes & compute nodes unless you are sure that it is of benefit to your workloads. Leaving it off will also helps while making sense of things like Lizard results.

    3. Following on from 2. It's difficult to say with certainty that you will or will not see benefit from Hyperthreading. There are some interesting HT results documented in here: http://i.dell.com/sites/content/business/solutions/whitepapers/en/Documents/HPC_Dell_11G_BIOS_Options.pdf and if you take anything away from the document it's that HT helps sometimes, hinders others. Heck, results are different even between different solves using the same application. I think the only way to be sure is to test your own cases and decide. 

    4. Nehalem processors do indeed process 4 FLOPS per core cycle. In your case this works out something like 4 flops per cc * 8 cores * 2.8GHz = 89.6 GFlops. Your cluster efficiency across a couple of nodes should be pretty decent, efficiency tends to drop off as number of nodes increases (obviously how much it drops of depends on your interconnect specs). I'm thinking you should be seeing up in the 80-90% efficiency numbers at least.

    5. Have you restarted your headnode recently? If not it might be worth doing so as this may be simply a reporting issue.

    Regards

    Dan

    PS Did you manage to resolve the IPV6 issue?

    Monday, May 24, 2010 3:33 PM
  • Hi Dan,

    Sorry for taking so long to get back. I do appreciate the time you took to post your comments.

    1. My head and compute nodes are slightly different servers but the same processors and memory (x5560 x 2 and 24Gb RAM in each). They do not run any services, this is a private cluster that only I have access to.

    2-3. I haven't tuned off HT yet but came accross the Dell paper you mentioned and figured it's probably best.

    4. Just to make sure, when you say 8 cores you mean quad times 2 for HT right?

    5. I have restarted the headnode several times in the last week.

    At the moment I am still struggeling with configuring the cluster properly. For some reason I am unable to have NAT routing my compute node (it runs on a private network with the head), so I do not have internet connection. At the same time my compute node Windows asks for activation (apparantly it was not activated with the correct product key) but I cannot access the internet to resolve it with Microsoft.

    Both Lizard and Matlab Validation tests are now failing, suggesting that it takes forever for submitted jobs to get started. The strange thing is that everything was working fine after the initial cluster setup. Here are the things I did after the initial (potentially correct) configuration that possibly resulted in the current mess:

    1. I was trying to set up a DHCP service on the headnode to alleviate future node deployment. Since I was unable to configure the cluster that way I
    uninstalled the DHCP and restored manually all DNS addresses. Cluster validation through Windows HPC Cluster Manager seemed to be ok, so I assumed I restored the original configuration.

    2. I tried to disable TCP IPV6 in both nodes by adding a registry entry, to avoid NAT server warnings associated with RDP sessions. I think I finally nailed this one following the instructions in: http://blogs.dirteam.com/blogs/paulbergson/archive/2009/03/19/disabling-ipv6-on-windows-2008.aspx

    3. In order to use Windows backup for the headnode HD, I shrinked the  compute node HD and partitioned it into two volumes, to enable a backup of
    the headnode HD on one of these volumes.

    4. To avoid password entry when communicating between headnode and compute node I changed group policy to enable saving password and credential
    delegation. Although I am not asked for a password during the Matlab parallel jobs validation, during RDP from head to compute node I still need to put in a password with a notice: "your credentials did not work; your system administrator does not allow to save credentials...".

    I cannot see how any of the above will create the issues I am seeing (most prominently the NAT routing problems to the headnode), but if anything comes to mind, please let me know. Thanks for your help.

    Saturday, May 29, 2010 12:00 AM
  • A quick update: I was able to get NAT routing alright, so I went back to check performance:

    1. Lizart pack: It did pass consistency tests this time but ended with best performance of 124.9GF translated into 17.4%. I guess the right efficiency # should be 124.9G/(32*4*2.8Gb)=34.8% (I originally pu 8 flops per cc, and do not know how to change it), still a very low number for a two node cluster.

    2. Matlab cluster validation test fails on  the prallel computing job: MPI start-up time is too long and the job is canceled. Matlab uses Microsoft MPI, so I am guessing the same problem (perhaps newtork related) causes also the low efficiency seen by Lizard.

    Does this look like a configuration or hardware issue?

    Thanks, Guy

    Monday, May 31, 2010 6:54 PM
  • Hi Guy

    Apologies for the delay in replying, I've been busy doing various exciting things such as sitting by the pool with cold beer and changing jobs.

    One thing that immediately comes to mind is to run the full suite of cluster diagnostics. Do you see any failures, and what are the results from the MPI tests?

    Cheers

    Dan

    Wednesday, June 23, 2010 7:59 AM
  • Hi Dan,

    Definitely more important than to answer silly HPC quesiton. 

    Here is a quick update on my status with couple of new questions:

    I came to realize that the problems are connected with DHCP. originally I set up a static address for the compute nodes, and removed DHCP role from the head. Bringing back DHCP and configuring it outside HPC was not accepted nicely by the cluster manager. So I did a clean reinstallation of Windows and HPC, this time keeping DHCP, and it seems my problems are gone. I was able to validate Matlab, and Lizard test shws 91.5% efficiency. Now for the questions:

    1. I am still using static addresses for the compute nodes, so although DHCP service is running I guess it is not soing it thing. Since everything seems to be working I am hesitant to reconfigure the private network to include DHCP. Are there benefits from using dynamic addressing in the private network, other than easier deployment for future nodes? If so, a quick reference on how to reconfigure the network would be much appreciated.

    2. I was trying to add a maintenance task in the node template, e.g. to install HPC SP2, where I saved the patch in a shared folder on the head HD. When trying to maintain the compute nodes, this task fails, saying the directory name is invalid.

    3. Checking the "Node availability report" the head node is showing to be 0% available. Any explanation to what this means?

    Thanks, Guy

    Friday, June 25, 2010 10:57 PM
  • Good to hear you've resolved your networking issues. As you've probably realised the HPC services keep a fairly tight leash on DHCP configuration.

    1. Now that you've got DHCP installed, it's quite a small step to configure a scope in the HPC management console. Just reconfigure the network settings with appropriate values for your scope. DHCP is a prerequisite of automated node deployment, and as it's in place anyway you may as well take advantage of its convenience. You say you have static addresses configured on all compute nodes at the moment, and DHCP will not interfere with those so the migration can be staggered. If you want to revert to DHCP you'll need to change network settings on your nodes. Not sure how many nodes you have, but this can either be done manually on each node, or you could use clusrun to push something like

    netsh interface ip set address "<Name of Network Connection>" dhcp

    Obviously test this out on one node to ensure it works well in your environment :)

    2. What sort of maintenance task are you adding, is it just a simple run command? Not wanting to state the obvious, but check the path you're using, try the FQDN of the server if you're using just the netbios name, make sure that share / NTFS permissions are correct. Another good debug step is to copy the file local onto a node, then change the maintenance task to point to the local file.

    3. Is your headnode configured as a compute node, and is it offline or online?

    Cheers

    Dan 

    Tuesday, June 29, 2010 10:31 AM
  • Thanks Dan,

    Just got back to work after relocating to a new home. I have a new concern regarding the DHCP.

    It seems the DHCP, running on the private network (I'm using topology 1) leased IP addresses to client machines on the public network (outside my cluster). Any clue on what may be the cause and how to correct this will be greatly appreciated.

    As far as headnode availability, it is configured as a compute node and is online at all times, but still shows 0% availability.

    Tuesday, July 13, 2010 9:32 PM