This year, I was again an adviser of my university's team in the student cluster competition taking place at ISC. In this competition, twelve teams (each made up of six bachelor students) competed to get the highest performance out of their self-assembled mini-clusters for a set of applications under a 3000 W power constraint. Once more, our team managed to bring the prestigious LINPACK award home to Erlangen!
Before going into the details, I'd like to extend my gratitude to our sponsors, without whom this whole affair would not have been possible: I'd like to thank HPE in general and Patrick Martin, our contact at HPE, in particular for providing us with a system according to our specifications and loaning it to us for free! I'd also like to thank GCS in general and Regina Weigand, our contact at GCS, in particular, as well as SPPEXA, for providing us with financial support to help with competition-related expenses. Now, with pleasantries out of the way, let's get down to business…
The 3000 W power limit enforced during the competition serves as equalizing factor and makes it necessary to work smart, not hard. You cannot simply ask your sponsor for a crazy amount of hardware and win the battle for performance by superior numbers. Instead, you have to optimize your system for energy efficiency, which in the case of LINPACK means floating-point operations per second per Watt (Flop/s/W), for the overall performance (Flop/s) is given by multiplying the system's energy efficiency (Flop/s/W) with the available power budget of 3000 W.
With energy efficiency the optimization target, CPUs were off the table: Although we had observed significant energy-efficiency improvements in recent CPUs (e.g., for LINPACK we got up to 8.1 GFlop/s/W on a 28-core Intel Skylake-SP Xeon Platinum 8170 processor compared to 4.6 GFlop/s/W on a 18-core Broadwell-EP Xeon E5-2697 v4 one), they were at least a factor of three behind contemporary GPUs in terms of GFlop/s/W. The obvious strategy thus consisted of packing as many GPUs as made sense into our nodes and use the CPUs merely to drive the host systems.
For the competition, we went with an HPE Apollo 6500 Gen9 system, comprising two HPE XL270d compute nodes connected directly via InfiniBand EDR. Each of the nodes was equipped with two 14-core Intel Xeon E5-2690 v4 processors, 256 GB of RAM, and Nvidia GPUs. In the previous competition, which took place in June 2017, we used (then state-of-the-art) Nvidia Tesla P100 GPUs. For the 2018 competition, we could get our hands on V100 GPUs, which were released in December 2017. Now the crux of the matter was determining how many GPUs to put into each of the host systems. Without GPUs, the base power of each node (i.e., the power used for CPUs, RAM, blowers, and so on) was around 250 W. The naive strategy would have been to divide the remaining power budget of 2500 W by the 250-Watt TDP of a V100 GPU. This strategy would have told us to use ten GPUs in total, corresponding to five per node. The LINPACK performance of a single V100 GPU (16 GB, PCIe version) running in Turbo mode—and therefore fully exhausting the 250 W TDP—is around 4.5 TFlop/s. So, for the ten-GPU setup we could have expected a performance of around 45 TFlop/s—not bad, considering this would have been a 42% improvement to our previous P100-based record of 31.7 TFlop/s.
However, there is room for improvement. In the post about the previous competition, the GPU core frequency was identified as variable with significant impact on energy efficiency. This relationship between frequency and energy efficiency is demonstrated in Figure 1. For one thing, the figure quantifies the energy-efficiency improvement of the V100 compared to its predecessor. For another, the data indicates that Turbo mode, where GPU cores are clocked at around 1380 MHz, is not a good choice when optimizing for energy efficiency. Instead, the GPU should be clocked at 720 MHz, for at this frequency the energy-efficiency optimum of 28.1 GFlop/s/W is attained. In theory, this strategy allows for a performance of 2500 W · 28.1 GFlop/s/W = 70.2 TFlop/s—a value that is 56% higher than what can be achieved by the naive approach of running the GPUs in Turbo mode.
Note, however, that in practice this value was not attainable using our two-node setup, which provided room for at most sixteen GPUs: A single V100 GPU consumes 118 W when running at the most energy-efficient frequency of 720 MHz, which means a total of 21 GPUs would be required to exhaust the remaining power budget of 2500 W to achieve an energy efficiency of 28.1 GFlop/s/W. Moreover, we could get our hands on only twelve V100 GPUs (back then with a retail price tag of around $20,000 apiece, these cards were, after all, quite expensive). Nevertheless, using the approach of tuning the GPU core frequency to increase energy efficiency, we were able to get 51.7 TFlop/s out of our two-node twelve-GPU setup—exactly 20.0 TFlop/s more than our previous record of 31.7 TFlop/s set at ISC17. It goes without saying that this was enough to secure the LINPACK award in this year's competition.
Finally, it is worth pointing out that during the investigations several problems with Nvidia's software and hardware became apparent that Nvidia should address as quickly as possible. Some of these problems make it impossible to deliver consistent performance, while others significantly limit the attainable performance in settings with no power constraints (e.g., when determining a system's TOP500 score). But this is something I'll address in a separate post.