Gauss Award 2018 goes to Erlangen!

Gauss Award
© Philip Loeper for ISC Event Photos

The Gauss Award is sponsored by the Gauss Centre for Supercomputing (GCS), which is a collaboration of the German national supercomputing centers at Garching, Jülich and Stuttgart. The winner receives a cash prize of 3,000€, courtesy of the Gauss Center, which is traditionally presented during the ISC Conference Opening Session. This year, our paper, which was the result of a joint effort between the Chair for Computer Architecture and the HPC group at the computing center (RRZE) of FAU, was selected as the winner by the Gauss Award committee:

Johannes Hofmann, Georg Hager, and Dietmar Fey: On the Accuracy and Usefulness of Analytic Energy Models for Contemporary Multicore Processors

The contributions we make in the paper are as follows. First, we propose a refinement to the execution-cache-memory (ECM) model, which corrects the model’s multi-core estimates for many-core processors by taking contention for shared resources into account. Secondly, we present analytic power and energy models for steady-state codes. These models take all relevant chip parameters into account and are capable of delivering high-quality estimates for scalable and saturating codes. Finally, we use the presented models and empirical data to improve the community’s understanding about processors’ power and energy properties by identifying governing principles and universal behavior across different architectures.

The full paper, covering Intel and AMD processors, is available here. A publicly available pre-print, which does not include results for AMD's Epyc CPU, can be found here.




Student cluster competition at ISC18 and energy-efficiency improvements of Nvidia's V100 GPU

This year, I was again an adviser of my university's team in the student cluster competition taking place at ISC. In this competition, twelve teams (each made up of six bachelor students) competed to get the highest performance out of their self-assembled mini-clusters for a set of applications under a 3000 W power constraint. Once more, our team managed to bring the prestigious LINPACK award home to Erlangen!

Student Cluster Competition trophies

Before going into the details, I'd like to extend my gratitude to our sponsors, without whom this whole affair would not have been possible: I'd like to thank HPE in general and Patrick Martin, our contact at HPE, in particular for providing us with a system according to our specifications and loaning it to us for free! I'd also like to thank GCS in general and Regina Weigand, our contact at GCS, in particular, as well as SPPEXA, for providing us with financial support to help with competition-related expenses. Now, with pleasantries out of the way, let's get down to business…

The 3000 W power limit enforced during the competition serves as equalizing factor and makes it necessary to work smart, not hard. You cannot simply ask your sponsor for a crazy amount of hardware and win the battle for performance by superior numbers. Instead, you have to optimize your system for energy efficiency, which in the case of LINPACK means floating-point operations per second per Watt (Flop/s/W), for the overall performance (Flop/s) is given by multiplying the system's energy efficiency (Flop/s/W) with the available power budget of 3000 W.

With energy efficiency the optimization target, CPUs were off the table: Although we had observed significant energy-efficiency improvements in recent CPUs (e.g., for LINPACK we got up to 8.1 GFlop/s/W on a 28-core Intel Skylake-SP Xeon Platinum 8170 processor compared to 4.6 GFlop/s/W on a 18-core Broadwell-EP Xeon E5-2697 v4 one), they were at least a factor of three behind contemporary GPUs in terms of GFlop/s/W. The obvious strategy thus consisted of packing as many GPUs as made sense into our nodes and use the CPUs merely to drive the host systems.

For the competition, we went with an HPE Apollo 6500 Gen9 system, comprising two HPE XL270d compute nodes connected directly via InfiniBand EDR. Each of the nodes was equipped with two 14-core Intel Xeon E5-2690 v4 processors, 256 GB of RAM, and Nvidia GPUs. In the previous competition, which took place in June 2017, we used (then state-of-the-art) Nvidia Tesla P100 GPUs. For the 2018 competition, we could get our hands on V100 GPUs, which were released in December 2017. Now the crux of the matter was determining how many GPUs to put into each of the host systems. Without GPUs, the base power of each node (i.e., the power used for CPUs, RAM, blowers, and so on) was around 250 W. The naive strategy would have been to divide the remaining power budget of 2500 W by the 250-Watt TDP of a V100 GPU. This strategy would have told us to use ten GPUs in total, corresponding to five per node. The LINPACK performance of a single V100 GPU (16 GB, PCIe version) running in Turbo mode—and therefore fully exhausting the 250 W TDP—is around 4.5 TFlop/s. So, for the ten-GPU setup we could have expected a performance of around 45 TFlop/s—not bad, considering this would have been a 42% improvement to our previous P100-based record of 31.7 TFlop/s.

However, there is room for improvement. In the post about the previous competition, the GPU core frequency was identified as variable with significant impact on energy efficiency. This relationship between frequency and energy efficiency is demonstrated in Figure 1. For one thing, the figure quantifies the energy-efficiency improvement of the V100 compared to its predecessor. For another, the data indicates that Turbo mode, where GPU cores are clocked at around 1380 MHz, is not a good choice when optimizing for energy efficiency. Instead, the GPU should be clocked at 720 MHz, for at this frequency the energy-efficiency optimum of 28.1 GFlop/s/W is attained. In theory, this strategy allows for a performance of 2500 W · 28.1 GFlop/s/W = 70.2 TFlop/s—a value that is 56% higher than what can be achieved by the naive approach of running the GPUs in Turbo mode.

Figure 1: LINPACK energy efficiency subject to GPU clock speed for Nvidia's Tesla P100 (16 GB, PCIe version) and V100 GPUs (also the 16 GB, PCIe version).

Note, however, that in practice this value was not attainable using our two-node setup, which provided room for at most sixteen GPUs: A single V100 GPU consumes 118 W when running at the most energy-efficient frequency of 720 MHz, which means a total of 21 GPUs would be required to exhaust the remaining power budget of 2500 W to achieve an energy efficiency of 28.1 GFlop/s/W. Moreover, we could get our hands on only twelve V100 GPUs (back then with a retail price tag of around $20,000 apiece, these cards were, after all, quite expensive). Nevertheless, using the approach of tuning the GPU core frequency to increase energy efficiency, we were able to get 51.7 TFlop/s out of our two-node twelve-GPU setup—exactly 20.0 TFlop/s more than our previous record of 31.7 TFlop/s set at ISC17. It goes without saying that this was enough to secure the LINPACK award in this year's competition.

Finally, it is worth pointing out that during the investigations several problems with Nvidia's software and hardware became apparent that Nvidia should address as quickly as possible. Some of these problems make it impossible to deliver consistent performance, while others significantly limit the attainable performance in settings with no power constraints (e.g., when determining a system's TOP500 score). But this is something I'll address in a separate post.




How to win LINPACK at the ISC-HPC '17 Student Cluster Competition

Student Cluster Competition trophies

This year I have been one of the advisors of my university's student team at the student cluster competition at ISC-HPC in Frankfurt. In this competition, eleven teams from eight countries—each made up of six undergradute students, advisors, and an individually compiled “mini-cluster” (big shout-out to HPE for being our hardware sponsor, again!)—fight to get the best performance for a number of appliations out of their hardware within an imposed 3 kW power limit. In addition to an overall winner, there is a separate trophy for the LINPACK benchmark—which gets a lot of special attention because the LINPACK benchmark is used for ranking systems in the TOP500 list (this list is released twice per year and lists the five hundred fastest supercomputers on the planet).

Hoping that some of my previous research about energy-efficient HPC might prove useful I instructed one of the students to measure LINPACK performance and peak power consumption of the corresponding run for all supported GPU clock speeds of an Nvidia P100 PCIe GPU accelerator. Figure 1 shows the results. Two observations can be made: First, the relationship between core frequency and performance is linear; secondly, the relation between frequency and dissipated power is exponential. The former is explained easily: Because the DGEMM kernel at the heart of LINPACK is compute-bound on modern architectures, doubling the frequency doubles the peak Flop/s of the machine; because DGEMM is compute-bound this for all intends and purposes will double performance as well (note that the increase in performance stops at 1252 MHz because the chip runs into its TDP limit of 250 W—meaning that for this application the chip does not manage to clock higher than 1252 MHz). The exponential relationship between frequency and power can be explained by physics (don't worry, I won't go into detail here; for those of you interested the relationship between power P and frequency f given in the relevant literature is Pf α for α > 1).

Figure 1: Measured LINPACK performance and dissipated power for different GPU clock speeds on Nvidia's P100 GPU (PCIe version).

Relating measured performance and dissipated power allows determining energy-efficiency as a function of GPU core frequency. The result, shown in Figure 2, indicates that running the GPU in default mode (i.e., without frequency adjustments) results in an energy-efficiency of around 13.0 GFlop/s/W. In contrast, running the GPU's cores at frequency in the range between 822–898 MHz will result in an energy-efficiency of 19.0 GFlop/s/W.

Figure 2: Energy-efficiency metric [GFlop/s per Watt] derived from previous measurement results for LINPACK using different GPU clock speeds on Nvidia's P100 GPU (PCIe version).

All left to do after locating the energy-efficiency sweet spot was chosing a setup that minimized the power consumption of the host system(s) while allowing to use as many GPUs as possible. We decided to go with two HPE ProLiant XL270d nodes in an HPE Apollo 6500 chassis (again, big thanks to Sorin-Cristian, Gallig, and Patrick from HPE!). Each of the nodes was equiped with two 14-core Intel Xeon E5-2690 v4 Broadwell-EP processors, 128 GB of RAM, a Mellanox EDR IB card, and six Nvidia P100 PCIe GPUs.

Although each ProLiant XL270d node is designed to fit eight PCIe accelerators, we found the limited number of PCIe lanes (forty per Broadwell-EP processor, so eighty per node) caused inter-GPU communitaction bandwidth to degrade and become the bottleneck when using more than six GPUs per node (node performance peaked at six GPUs and fell when adding more cards!). The final setup was thus two nodes, each containing six Nvidia P100 GPUs.

At this point all that was left to do was finding the GPU core frequency that took total system power consumption as close as possible to 3 kW—without going over it (dissipated power is measured during the whole competition and there are penalties for going over the imposed budget of 3 kW). At a frequency of 1063 MHz total system power consumption was 2970 W. Although this frequency is outside the previously determined energy-efficiency sweet spot, overall performance is higher in this setting—and at the end of the day, this is what matters! Using this configuration the system managed to set a new record of 37.1 TFlop/s for LINPACK—allowing our students to take the first place! The achieved performance improved the previous record of 31.7 TFlop/s (established at ASC16 by a team using Nvidia P100 GPUs as well) by 17%.




Manually setting the Uncore frequency on Intel CPUs

After receiving several inquiries about how to manually set to Uncore frequency I wrote this article on the subject so it can be used as a refence by others. In it I briefly discuss what the Uncore frequency is, why you should care about it, and suggest two ways to set it. A more detailed account on the Uncore and its impact can be found in our latest ISC paper [1] (try the pre-print version of the article if you have trouble accessing the conference proceedings for free from within your network).

Uncore frequency domain

If you are unfamiliar with the nomenclature, let's bring you up to speed: Each component of an Intel processor belongs either to the “core” or the “Uncore” part of the chip. The core part comprises the individual computational cores and their private L1 and L2 caches; the Uncore part everything else, such as, e.g., the shared last-level cache, memory controllers, PCIe interfaces, etc. Before Intel's Haswell microarchitecture, the core and Uncore parts of the chip shared the same frequency domain. This meant setting your CPU cores to run at a clock frequency of 3.0 GHz resulted in the Uncore running at the same frequency.

Starting with the Haswell microarchitecture things changed (see [1] for rationale). Processors now feature separate frequency domains for the cores (in fact, each core's frequency can now be set individually) and the Uncore. This means you can run your cores and the Uncore at different clock speeds. Together with this change, Intel introduced a feature called Uncore frequency scaling (UFS), which, when enabled, allows the processor to dynamically change the Uncore frequency based on the current workload. When UFS is disabled the Uncore is clocked at its maximum clock frequency.

Implications

The Uncore frequency can impact a number of key characteristics of your chip. For example, the Uncore clock has a direct impact on sustained main memory bandwidth. This is shown in Figure 1 which shows full-chip sustained main memory bandwidth for the Schönauer vector triad for different Uncore frequencies measured on Haswell-EP (Xeon E5-2695 v3) and a Broadwell-EP (Xeon E5-2697 v4). Increasing the Uncore clock increases bandwidth between main memory and the cores up to the point where the interconnect is fast enough to deliver the full main memory bandwidth; at this point, main memory becomes the bottleneck and increasing the Uncore clock further will no longer result in higher bandwidth.

Figure 1: Measured sustained main memory bandwidth on selected CPUs as function of Uncore frequency.

From a performance perspective it makes sense to make the Uncore run at least at the frequency that is required to attain the maximum main memory bandwidth (i.e., 2.1 GHz or higher for the Haswell-EP processor respectively 1.9 GHz or higher for the Broadwell-EP processor). From an energy perspective, however, it makes sense to set the Uncore frequency exactly to the frequency at which the maximum main memory bandwidth is achieved; increasing the Uncore clock beyond this point does no longer increase performance but will increase power consumption. Unfortunately, we found that UFS will run the Uncore at its maximum frequency at the slightest hint at memory starvation. While this is guaranteed to get the maximum performance, it will do so at the cost of energy-efficiency, because this performance could also have been reached at a much lower Uncore frequency. Disabling UFS will not help either: Without UFS the Uncore will always run at its maximum frequency. Optimizing energy-efficiency for memory-bound codes thus requires manually setting the Uncore frequency.

So far not manually setting the Uncore frequency only had implications on energy-efficiency. We have, however, observed that using the chip in any of the possible default settings (i.e., running the chip with either UFS enabled or disabled) can also be detrimental to performance! Although the LINPACK benchmark is notoriously compute-bound, it has a non-negligible bandwidth requirement; as a result UFS will clock the chip to its maximum frequency (this is of course also the case when UFS is disabled). LINPACK typically pushes the power consumption of a chip to its TDP limit, which means that both the cores as well as the Uncore are competing for the limited power budget. Because the Uncore is clocked higher than necessary its share of power from the budget is higher than necessary as well; this in turn leaves less of the budget for the computational cores which limits their attainable frequencies. Manually adjusting the Uncore clock allowed us to to locate the Uncore frequency sweet spot for LINPACK that leaves the maximum amount of the power budget for the CPU cores while providing sufficient memory bandwidth not to starve the cores of data (again, see [1] for details).

Setting the Uncore frequency

The Uncore frequency is adjusted via the lower 16 bits of MSR 0x620. The register contents represent upper and lower bounds for the Uncore frequency: Bits 15 through 8 encode the minimum and bits 7 to 0 the maximum frequency. To derive a frequency from each of the two eight bit groups the integer encoded in the bits has to be multiplied with 100 MHz.

To access the MSR I recommend using Intel's msr-tools (note that the msr kernel module needs to be loaded!). You use rdmsr for reading and wrmsr for writing MSRs. Now that we have all of the theory covered, let's have a look at some examples.

A good way to start is to find out the supported Uncore frequency range of your processor. Enabling UFS in the BIOS will set minimum and maximum allowed frequencies in MSR 0x620 to the minimum and maximum supported Uncore frequencies of your processor. Enabling UFS in the BIOS of a system containing two Xeon E5-2697 v4 processors and reading the MSR's contents for one of the processors in the system (the -p flag is used to specify the processor number in a multi-socket system) yields the following:

broadep2:~ $ rdmsr -p 0 0x620
c1c

Note that rdmsr outputs the register's contents in hex (unfortunately it does not prefix the hex value with 0x as it should to avoid confusion between hex and decimal!) and does not print leading zeroes. Taking the liberty of fixing rdmsr's shortcomings, the MSR's contents in hex are 0x0c1c. The first eight bits, i.e., 0x0c, correspond to a decimal value of 12; the second eight bits, i.e., 0x1c, to a decimal value of 28. Multiplying these values with 100 MHz yields minimum and maximum Uncore frequencies of 1.2 GHz and 2.8 GHz, respectively.

Setting minimum and maximum frequencies is done via wrmsr. To set new a new minimum frequency of 1.6 GHz and a new maximum frequency of 2.0 GHz the frequencies are first divided by 100 MHz, yielding decimal values of 16 and 20. These are then converted to hexadecimal (0x10 for 16 and 0x14 for 20) and concatenated: 0x1016. The command line to make the frequency adjustments would therefore be: wrmsr -p 0 0x620 0x1016 . If you'd like to fix the Uncore to a specific frequency, make the minimum and maximum frequency identical; e.g., if you want the Uncore fixed at 2.0 GHz you would use wrmsr -p 0 0x620 0x1616.

Note that specifying a frequency outside the supported range will not have any effect: Specifing a frequency below or above the supported range will result in the frequency being capped at the lowest or highest supported frequency, respectively.

If you're looking for an easier way to set the Uncore frequency then the likwid tool suite might be for you. Make sure to use at least version 4.3.0 as support for setting the Uncore frequency is pretty new. Using likwid-perfctr in combination with its --umin and --umax parameters provides a more pleasant interface than accessing the MSR directly. You can find more information about setting the Uncore frequency with likwid here.

Measuring the Uncore frequency to check whether your changes had any effect is done most easily via likwid-perfctr (which is also part of the likwid tool suite). Figure 2 shows Uncore frequency and package power consumption based on the UNCORE_CLOCK and PWR_PKG_ENERGY hardware events for an idle Xeon E5-2697 v4 chip. The minimum and maximum Uncore frequencies were initially set to 1.2 GHz using wrmsr and increased by 100 MHz every five minutes. We find that changes made via wsmsr to MSR 0x620 were successfully realized as refleced by the measured Uncore frequency; in addition, the graph shows the Uncore frequencies impact on energy consumption: Going from 1.2 GHz to 2.9 GHz doubles the dissipated power—even when the chip is idle!

Figure 2: Measured Uncore frequency and package power consumption on a Xeon E5-2697 v4 chip.



Evolution of Cache Replacement Strategies

To better acquaint my students with the intricacies of caches I let them write an LRU cache simulator which they then use to determine hit rates for a streaming access pattern. As a follow-up exercise they have to compare the results of their simulations to data obtained via performance counters on a real CPU's (Ivy Bridge-EP) L1 cache. While working out the exercise I took the opportunity and ran some more benchmarks to get a better idea what to expect from L2 and L3 caches of recent Intel microarchitectures—a quest that included a painful trial-and-error search for the hardware events from which to infer these caches' hit rates!

Methodology

For all following benchmarks I used the simple streaming code shown below which accesses every eighth element of a double precision floating-point array (one cache line on Intel is 64 byte so jumping over eight doubles—8 byte each—results in accessing each cache line once).

LIKWID_MARKER_INIT;
LIKWID_MARKER_START("sum");

// choose T so that runtime is ~1s
for (t=0; t<T; ++t)
#pragma vector aligned
#pragma nounroll
    for (i=0; i<N; i=i+8)
        sum += A[i];

LIKWID_MARKER_STOP("sum");
LIKWID_MARKER_CLOSE;

I used the likwid marker API in combination with likwid-perfctr to access hardware performance counters, which measure total cache accesses and cache hits for a particular cache level. To avoid hardware prefetchers biasing results all hardware prefetchers (DCU IP prefetcher, DCU prefetcher, L2 hardware prefetcher, L2 adjacent cache line prefetcher) were disabled using likwid-features. Each empirical result corresponds the median of one hundred samples. Processors used for evaluations are based on the four most recent Intel Xeon EP microarchitectures: Intel Xeon E5-2680 (Sandy Bridge-EP), Xeon E5-2690 v2 (Ivy Bridge-EP), Xeon E5-2695 v3 (Haswell-EP), and Xeon E5-2697 v4 (Broadwell-EP).

L1 Caches

L1 cache hit rates

To start off, the figure above shows hit rates for L1 caches on Sandy Bridge-EP (SNB), Ivy Bridge-EP (IVB), Haswell-EP (HSW), and Broadwell-EP (BDW). You can see that the measurements (black circles) almost perfectly correspond to the prediction of the cache simulator (dashed red line) on all four microarchitectures. The L1 cache uses eight ways on all four microarchitectures, so we observe non-zero hit rates as long as the data set we are streaming over is smaller than 9/8ths of a cache's 32 kB capacity. Cache hit rates are computed using the MEM_LOAD_UOPS_RETIRED_L1_HIT and MEM_UOPS_RETIRED_LOADS_ALL events.

L2 Caches

L2 cache hit rates

Results for the L2 caches—which, according to official documentation implement an LRU replacement strategy as well—are shown in the figure above. Here, we can observe significant differences between SNB and IVB on one side and HSW and BDW on the other. For SNB and IVB, hit rates do not conform to a LRU replacement strategy; also, hit rates are very erratic—hinting at some sort of randomness involved in the replacement strategy. HSW and BDW hit rates perfectly correspond to the simulator and analytic prediction (the L2 also cache uses eight ways so cache hits can be observed as long as the data set is smaller than 9/8ths of the caches' capacities of 256 kB). Total number of L2 accesses were measured indirectly via the L1D_REPLACEMENT event: whenever a cache line in L2 cache is accessed, a cache line from L1 is replaced with that line. L2 cache hits are collected using the MEM_LOAD_UOPS_RETIRED_L2_HIT event.

L3 Caches

L3 cache hit rates

According to documentation, the L3 cache uses a pseudo-LRU replacement strategy. The figure above shows a dashed red line for LRU nevertheless, just to get an idea of what the implemented replacement strategies are capable of. Because I wanted to compare replacement strategies I normalized the x-axis to the cache size in percent instead of showing absolute capacity in kB, which of course is increasing with the number of cores and would bias results. The first thing to notice is a change in the replacement strategy when going from SNB to IVB (note that microarchitectural changes are not exclusive to “tocks!”) and from IVB to HSW. Interestingly the change from SNB to IVB results in a lower cache hit rate for our streaming access pattern; it might, however, result in better cache hit rates for other access patterns; also, replacement latency might be affected in one or the other way. When going from IVB to HSW, we find that the cache hit rate has improved significantly for streaming access patterns, observing hit rates of up to 30% for data sets 1.5× the size of the cache! Selecting the right events to compute the hit rate for the L3 cache proved a little difficult. In general, there's two ways to go: either using the offcore response counters or using the MEM_LOAD_UOPS_RETIRED.L3_ALL and MEM_LOAD_UOPS_RETIRED.L3_HIT events. The problem with the former is that it misses some occurances of the events on SNB and IVB and is not reliable at all for measuring the events on HSW and BDW. That's why I went with the MEM_LOAD_UOPS_RETIRED.L3_* counters. But there's some pitfalls to this options as well. First, you have to make sure to use SSE loads on SNB and IVB, because on these microarchitectures the event doesn't count events triggered by AVX loads. Second, you have to work around a bug in the implementations of the counters for SNB: here, you can find a script called latego.py which you can use make the counters work.



A First Glimpse at Skylake

As a “tock” in Intel's tick-tock model, Skylake-EP introduces major improvements over the previous Haswell-EP and Broadwell-EP microarchitectures. Some of the advertised enhancements include the introduction of 512 bit SIMD (AVX 3.2/AVX-512F), three instead of two memory controllers per chip (this alone should increasing memory bandwidth by 50%, not taking into account further bandwidth improvements obtained by raising the DDR memory clock speed), and the somewhat mysterious appearance of an FPGA. Unfortunately, Skylake-EP is still far away—Intel hasn't even released the previous Broadwell-EP chips, which were originally scheduled for Q4/15 but are now expected in Q1/16. However, Skylake mobile and desktop chips have been available for quite some time and it is about time we gave them a test-drive.

In this post I will examine an Intel Core i5-6500 chip using micro-benchmarks to investigate some of the improvements that already found their way into the desktop chips of the Skylake architecture. According to the Intel 64 and IA-32 Architectures Optimization Reference Manual these enhancements include improved front end throughput, deeper out-of-order execution, higher cache bandwidths, improved divider latency, lower power consumption, improved SMT performance, and balanced floating-point add, multiply, and fused multiply-add (FMA) instruction throughput and latency. From these candidates I picked those that are in my opinion most relevant to my work in high performance computing: micro-op throughput in the front end, new functional units and deeper out-of-order execution in the back end, caches, and instruction latencies.

Front and Back End Improvements

The Skylake front end has seen multiple improvements regarding micro-op throughput. The legacy decode pipeline, which decodes CISC instructions to RISC-like micro-ops, can now deliver up to five (previously: four) micro-ops per cycle to the decoded micro-op queue. The decoded micro-op queue has been renamed to instruction decode queue (IDQ). The decoded instruction cache, which caches up to 1536 decoded micro-ops, was renamed as well and is now known as decode stream buffer (DSB). The DSB bandwidth to the IDQ has been increased from four to six micro-ops per cycle.

In the back end, the out-of-order window was increased by over 16% from 192 (Haswell) to 224 micro-ops. Before Haswell, there only existed one floating-point add and one floating-point multiplication unit. Haswell then introduced two FMA units and doubled the number of multiplication units. Skylake now also doubled the number of add units, resulting in a total of two fused multiply-add, two multiplication, and two addition units. We verified this using a vector reduction micro-benchmark: Adding up the contents of one cache line (64 byte) requires two AVX load instructions (loading 32 byte each) and two AVX add instructions (processing 32 byte of inputs each). With sufficient loop unrolling to work around the instruction retirement limit of the core, these four instructions can be processed in a single clock cycle on Skylake. Before Skylake, processing the two AVX add instructions took two cycles, because there was only one add unit. These results are shown in the figure below.

Skylake add units

Cache Improvements

As advertised, cache performance is also significantly better across all cache levels on Skylake. The L1-L2 cache bandwidth is advertised as 64 byte per cycle; the L2-L3 cache bandwidth as 32 byte per cycle. In the figure above, we can see that the time it takes to process one cache line (64 bytes) of data increases from one to two cycles when dropping out of the L1 cache and into the L2 cache. This exactly corresponds to the delay caused by transferring the 64 byte cache line from L2 to L1 at the advertised speed of 64 byte per cycle. When dropping out of the L2 cache and into the L3 cache we see performance degrade from two to four cycles per cache line—again, exactly what we expect the delay to be when transferring the cache line at a bandwidth of 32 byte per cycle from L3 to L2 cache. The improved in-memory performance of Skylake can be attributed to both a higher memory clock and a lower Uncore latency penalty, because the Skylake chip is only a Desktop chip with 4 cores compared to the 14 cores of the Haswell-EP chip.

Core cache capacities remain unchanged at 32 kB L1 and 256 kB L2 per core. However, the L2 cache is no longer 8-way associative; according to Intel, the decrease to 4-way associativity was motivated mainly by reduced power consumption. The i5-6500 has a total of 1.5 MB L3 cache per slice.

Instruction Latencies

As far as instruction latencies are concerned, there are two important things to note: First, the marketing of instruction latencies as “balanced” should be taken with a grain of salt. While most floating-point instructions now have a latency of four clock cycles, most of these latencies have increased when comparing them to previous generations—making the positive connotation of “balanced” perhaps a bit misleading. Take for example floating-point multiplication; its latency has been decreased from five cycles (Haswell) to three cycles (Broadwell). Now (Skylake) it is four cycles. Floating-point addition, which for several generations has had a latency of three cycles has increased to four cycles.

My second remark on the subject of instruction latencies is that most latencies appear to be documented incorrectly in the current Optimization Reference Manual (was: Order Number: 248966-031; September 2015). The table below summarizes the measured instruction latencies; values in red indicate deviations between my measurements and values taken from the Optimization Reference Manual.

Instruction Measurement Documentation Table in Optimization Manual
vaddps/vaddpd (AVX) 4 4 C-8
addps/vaddpd (SSE) 5 4 C-15, C-16
addss/vaddsd (scalar) 5 4 C-15, C-16
vmulps/vmulpd (AVX) 4 4 C-8
mulsd/mulpd (scalar/SSE) 4 3 C-15
mulss/mulps (scalar/SSE) 4 4 C-15
vfmaddxxxps (AVX) 4 4 (balanced latency)
vfmaddxxxpd (AVX) 5 4 (balanced latency)
vfmaddxxxps/vfmaddxxxpd (SSE) 5 4 (balanced latency)
vfmaddxxxss/vfmaddxxxsd (scalar) 5 4 (balanced latency)
vdivpd, divpd, divsd (AVX, SSE, scalar) 13 14 C-8, C-15
vdivps, divps, divss (AVX, SSE, scalar) 11 11 C-8, C-16

Most interesting to me is the different latency of AVX and SSE/scalar floating-point add instructions. While my measurements for the AVX instruction fit the documented latency of four cycles, they indicates the SSE and scalar instructions have a latency of five instead of the documented four cycles. This suggests that there are dedicated hardware units for scalar/SSE and AVX add instructions. While this may seem unlikely at first, it makes perfect sense when taking the new Turbo mode into accout: Starting on Haswell, a core is clocked either at the advertised Turbo frequency or at a slightly lower frequency (called AVX Turbo) depending on whether only scalar and SSE instructions or in addition AVX instructions are used. This means that scalar/SSE floating-point units have to be clocked higher as AVX units, a feature achieved by increasing the pipeline depth, e.g. from four to five steps, corresponding to an increase in latency from four to five cycles.

For AVX multiplication (vmulps/vmulpd) the measured latency of four clock cycles fits the documentation. Discrepancy arises for the scalar and SSE instruction for double-precision multiplication. FMA latencies are not explicitly documented in the Optimization Reference Manual but the “balanced” instruction latency implies a latency of four cycles. Measurements only confirm the vfmaddxxxps instruction as having a latency of four cycles; all other FMA instructions have a latency of five clock cycles. The measured latencies of divide instructions are 13 cycles for double-precision (vdivpd, divpd, divsd) respectively 11 cycles for single-precision (vdivps, divps, divss); the documented latencies are 14 cycles four double-preicision and 11 cycles for single-precision. The major improvement for dividing lies in the fact that the latencies of the AVX instructions match those of the SSE instructions—indicating full AVX divider units in contrast to previous architectures.

Conclusion

We verified the Core i5-6500 Skylake chip already has some of the features of Skylake-EP such as the second add pipeline, better cache bandwidths than Haswell-EP, and “balanced” instruction latencies but are eagerly awaiting the release of the full-blown Skylake-EP microarchitecture that will see additional changes, such as, e.g., the introduction of AVX-512.




Memory Bandwidth on Haswell-EP

Apart from upgrading from DDR3 to DDR4 and increasing the memory clock, it seems that Intel also worked some magic by decoupling the frequency domains of the cores and the Uncore.

Before Haswell, the sustained memory bandwidth was coupled to CPU frequency. The reason for that is that on Sandy and Ivy Bridge, the Uncore part of the chip, which connects the cores to the memory controllers, runs at the same clock frequency as the fastest clocked core. The figure below shows the sustained bandwidth for a load-only benchmark as a function of the number of CPU cores. Using a Sandy Bridge EP Xeon E5-2680 chip in turbo mode (black line) a sustained bandwidth of about 45 GB/s is possible; however, using the same chip with cores clocked at the lowest available frequency of 1.2 GHz (dashed black line), the sustained memory bandwidth drops below 24 GB/s—a decline of almost 50%! The same behavior can be observed for an Ivy Bridge EP Xeon E5-2690 v2 chip. Here the sustained main memory bandwidth in turbo mode (red line) is about 56 GB/s; with cores clocked at 1.2 GHz (dashed red line) the sustained memory bandwidth is just below 30 GB/s—again a declide of almost 50%!

Sustained memory bandwidth on Haswell

The situation is different on a Haswell EP Xeon E5-2695 v3 chip. Here both in turbo mode (blue line) and using the lowest possible frequency (dashed blue line) a sustained main memory bandwidth of 62 GB/s can be reached—although saturating the bandwidth takes more cores using the lower frequency. Bearing in mind that CPU frequency is the most important variable influencing package power usage, this invariance of sustained memory bandwidth of frequency has significant ramifications for energy usage for bandwidth-limited algorithms: absent the need for a high clock frequency to perform actual processing, the CPUs' frequencies can be lowered, thereby decreasing power consumption, while sustained memory bandwidth stays constant. A more detailed examination of the reduced power requirements can be found in our technical report on Haswell [1].




Correctly Measuring Runtimes

While measuring the performance or runtime of a code may seem trivial, there is a lot than can go wrong. In this post I discuss some of most frequent problems my students encounter and how to work around them.

The issue that probably arises most is related to Turbo mode found in modern processors. Typically, CPU cores run at the lowest possible CPU frequency when there is no work to do in order to preserve energy—which is a good thing. As soon as demand for computing power arises, the cores get clocked up. This makes perfect sense for most use cases. If, however, you want to verify the runtime of your code against a performance model you do not want your clock frequency to change. A varying clock rate during measurements will leave you clueless about what the actual frequency was during your measurement: Your CPU will probably start in the lowest frequency state. After your code has been running for some milliseconds (the timeframe a frequency change takes) it will clock up. However, the core may not adjust to the maximum frequency in just one recalibration. It may go through some or all of the possible CPU frequencies until it reaches the maximum frequency. Running your code for a really long time to guarantee that most of the time the CPU was using its maximum Turbo frequency is also not a good idea, because it is not trivial to determine the maximum CPU frequency. For example, the maximum Turbo frequency specified by Intel is the guaranteed frequency when using a single core that does not use AVX code. The actual frequency may be higher than the Turbo frequency guaranteed by the spec sheet. The reason for this lies in the nature of the fabrication process, which causes the quality of chips to vary. As a consequence, some chips of a given CPU model can achieve higher clock rates than others. Starting with Haswell, the situation gets even more complicated, because different guaranteed Turbo frequencies exist depending on whether your codes uses AVX instructions or not. The solution to this problems is to fix the CPU clock frequency. This can comfortably be done with various tools such as likwid-setFrequencies from the likwid suite or cpufreq-set from the cpufrequtils package available for most Linux distributions.

Another problem I've seen students suffer with is the question of how long their code should run in order for them to obtain significant measurement results. We will examine this question using an example:

for (T=1; runtime<min_time; T*=2) {
    start = get_time();

    for (t=0; t<T; ++t)
        ddot(A, B, N);

    runtime = get_time() - start;
}

T /= 2;
printf("MUp/s: %f", (N*T)/runtime);

The ddot(A, B, N) function shown above computes the dot product for arrays A and B, both of length N. The question is, how many iterations of the code should we run in order to get a significant result. To answer this question, we set the number of times t the code is run based on a minimum runtime min_time that we will vary. In the plot below, the x-axis shows different minimum measurement times. For each of them, one hundred samples were taken on a single core of an Ivy Bridge-EP Xeon E5-2660 v2 chip. The y-axis shows the coefficient of variation for the hundred samples taken for a given minimum measurent time. The coefficient of variation is calculated as the standard deviation σ of the distrubition divided by its mean μ. It is therefore a measure of how scattered the samples are. A high scattering indicates that a single sample may deviate a lot from the mean. We find that using a runtime of at least 10 ms is required to keep the variation of samples at around 1%. Note that in accord with the previous paragraph about Turbo mode, the black line which corresponds to measurements taken without frequency fixing exhibits higher variance than the samples that were taken with the frequency fixed at the nominal CPU clock of 2.2 GHz.

Deviation in results

The last point I would like mention and that was already hinted at in the plot above is that of thread pinning. As always, thread pinning is recommended to increase reproducability of results. More information about thread pinning can be found in one of my previous posts.




The Benefits of Thread Pinning

By default, your operating system lets software threads float between all available processors. This means when your system's scheduler decides to run some kernel routine on core x it might preempt a thread running on this core and move it to core y. While this might seem like a good idea—i.e. the thread can continue running somewhere else while the kernel is running—it might actually be the cause of performance issues.

One of the major problems with moving threads between physical cores is related to caches. Assume a thread is running on core x and has filled the core's 32 kB of L1 cache with data. Now, the thread is moved to core y. The thread continues to work on its data, which is no longer available the core's cache(s). Recent Intel architecitres (e.g. Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake) have inclusive last-level caches. This means that in the best-case scenario (where data in core x's L1 cache was not modified) data can be brought in from the shared L3 cache, which due to inclusiveness holds a copy of all other data inside the cache hierarchy. On Ivy Bridge, which provides a bandwidth of 32 byte/cycle between the L3 and L2 cache as well as the L2 and L1 cache, it takes over two thousand cycles to move the 32 kB from the shared L3 to the L1 cache of core y. If data was modified, this penalty is doubled, because data has to be moved from core x's L1 cache to the shared L3 cache before it can be brought into core y's caches, which makes a worst-case penalty of roughly four thousand cycles.

The problem is easily solved by thread pinning. Both Linux and Windows provide the means to pin threads via the sched_setaffinity respectively SetProcessAffinityMask calls. These functions enable programs to specify a pinning mask—a list of processors on which the calling thread may be scheduled— for each thread. By assigning each thread its dedicated processors, the issue of “cold caches” no longer arises: whenever a thread is preempted from its core, thread pinning guarantees that the thread will resume execution on exactly that core.

Implications of thread pinning

To give an example of the real-world impact of thread pinning, consider the figure above. It depicts the performance of a 2D jacobi stencil on a dual-socket Ivy Bridge EP Xeon E5-2650 v2 machine with DDR3-1600 memory for varying dataset sizes. Not only do we get a much better performance when empoying thread pinning, the performance is also more consistent. The unpinned version behaves erratic, which can be attributed to the fact that thread preemption may happen more often in some runs than others.