This year I have been one of the advisors of my university's student team at the student cluster competition at ISC-HPC in Frankfurt. In this competition, eleven teams from eight countries—each made up of six undergradute students, advisors, and an individually compiled “mini-cluster” (big shout-out to HPE for being our hardware sponsor, again!)—fight to get the best performance for a number of appliations out of their hardware within an imposed 3 kW power limit. In addition to an overall winner, there is a separate trophy for the LINPACK benchmark—which gets a lot of special attention because the LINPACK benchmark is used for ranking systems in the TOP500 list (this list is released twice per year and lists the five hundred fastest supercomputers on the planet).
Hoping that some of my previous research about energy-efficient HPC might prove useful I instructed one of the students to measure LINPACK performance and peak power consumption of the corresponding run for all supported GPU clock speeds of an Nvidia P100 PCIe GPU accelerator. Figure 1 shows the results. Two observations can be made: First, the relationship between core frequency and performance is linear; secondly, the relation between frequency and dissipated power is exponential. The former is explained easily: Because the DGEMM kernel at the heart of LINPACK is compute-bound on modern architectures, doubling the frequency doubles the peak Flop/s of the machine; because DGEMM is compute-bound this for all intends and purposes will double performance as well (note that the increase in performance stops at 1252 MHz because the chip runs into its TDP limit of 250 W—meaning that for this application the chip does not manage to clock higher than 1252 MHz). The exponential relationship between frequency and power can be explained by physics (don't worry, I won't go into detail here; for those of you interested the relationship between power P and frequency f given in the relevant literature is P ∼ f α for α > 1).
Relating measured performance and dissipated power allows determining energy-efficiency as a function of GPU core frequency. The result, shown in Figure 2, indicates that running the GPU in default mode (i.e., without frequency adjustments) results in an energy-efficiency of around 13.0 GFlop/s/W. In contrast, running the GPU's cores at frequency in the range between 822–898 MHz will result in an energy-efficiency of 19.0 GFlop/s/W.
All left to do after locating the energy-efficiency sweet spot was chosing a setup that minimized the power consumption of the host system(s) while allowing to use as many GPUs as possible. We decided to go with two HPE ProLiant XL270d nodes in an HPE Apollo 6500 chassis (again, big thanks to Sorin-Cristian, Gallig, and Patrick from HPE!). Each of the nodes was equiped with two 14-core Intel Xeon E5-2690 v4 Broadwell-EP processors, 128 GB of RAM, a Mellanox EDR IB card, and six Nvidia P100 PCIe GPUs.
Although each ProLiant XL270d node is designed to fit eight PCIe accelerators, we found the limited number of PCIe lanes (forty per Broadwell-EP processor, so eighty per node) caused inter-GPU communitaction bandwidth to degrade and become the bottleneck when using more than six GPUs per node (node performance peaked at six GPUs and fell when adding more cards!). The final setup was thus two nodes, each containing six Nvidia P100 GPUs.
At this point all that was left to do was finding the GPU core frequency that took total system power consumption as close as possible to 3 kW—without going over it (dissipated power is measured during the whole competition and there are penalties for going over the imposed budget of 3 kW). At a frequency of 1063 MHz total system power consumption was 2970 W. Although this frequency is outside the previously determined energy-efficiency sweet spot, overall performance is higher in this setting—and at the end of the day, this is what matters! Using this configuration the system managed to set a new record of 37.1 TFlop/s for LINPACK—allowing our students to take the first place! The achieved performance improved the previous record of 31.7 TFlop/s (established at ASC16 by a team using Nvidia P100 GPUs as well) by 17%.