Apart from upgrading from DDR3 to DDR4 and increasing the memory clock, it seems that Intel also worked some magic by decoupling the frequency domains of the cores and the Uncore.
Before Haswell, the sustained memory bandwidth was coupled to CPU frequency. The reason for that is that on Sandy and Ivy Bridge, the Uncore part of the chip, which connects the cores to the memory controllers, runs at the same clock frequency as the fastest clocked core. The figure below shows the sustained bandwidth for a load-only benchmark as a function of the number of CPU cores. Using a Sandy Bridge EP Xeon E5-2680 chip in turbo mode (black line) a sustained bandwidth of about 45 GB/s is possible; however, using the same chip with cores clocked at the lowest available frequency of 1.2 GHz (dashed black line), the sustained memory bandwidth drops below 24 GB/s—a decline of almost 50%! The same behavior can be observed for an Ivy Bridge EP Xeon E5-2690 v2 chip. Here the sustained main memory bandwidth in turbo mode (red line) is about 56 GB/s; with cores clocked at 1.2 GHz (dashed red line) the sustained memory bandwidth is just below 30 GB/s—again a declide of almost 50%!
The situation is different on a Haswell EP Xeon E5-2695 v3 chip. Here both in turbo mode (blue line) and using the lowest possible frequency (dashed blue line) a sustained main memory bandwidth of 62 GB/s can be reached—although saturating the bandwidth takes more cores using the lower frequency. Bearing in mind that CPU frequency is the most important variable influencing package power usage, this invariance of sustained memory bandwidth of frequency has significant ramifications for energy usage for bandwidth-limited algorithms: absent the need for a high clock frequency to perform actual processing, the CPUs' frequencies can be lowered, thereby decreasing power consumption, while sustained memory bandwidth stays constant. A more detailed examination of the reduced power requirements can be found in our technical report on Haswell .