Just a quick summary of a great presentation from Build 2014 called Native Code Performance on Modern CPUs: A Changing Landscape.

The presenter Eric Brumer (from Visual C++ Compiler Team) talked, in quite unique way, about deep down details of code optimizations. Why it is better to use compiler to do the hard work. Why new and powerful FMAD instructions can sometimes slow down your code. And how to generally think about code performance.

Summary  

Visual Studio has support for code generation using SIMD instructions: /arch:SSE /arch:SSE2 and then /arch:AVX and /arch:AVX2. The last one will be available for VS 2013 Update 2 and on Intel Haswell chips only.

Profile, profile, profile! I hear this all the time when watching/reading any presentation talking about performance. Maybe they are all right! :)

FMA can slow down the code!

  • It will be faster for a = yx + z, but not for a = yx + zw
  • For Intel mul is 5 cycles, add is 3 cycyles, FMA is 5.
  • So for the latter equation two muls will be executed in parallel and then added - in total 8 cycles
  • FMA version will first use mul for zw and then use FMA - in total 10 cycles.
  • Conclusion: be careful

256 bit code does not run 2X faster than 128 bit!

  • Computation and instruction execution is 2x faster, but we need to wait for memory
  • Highly efficient code is actually memory efficient code.


Source: Native Code Performance on Modern CPUs: A Changing Landscape

In the last part of the presentation there was an analysis of a performance bug in Eigen3 math library

  • Compiling with /arch:AVX2 (and /arch:AVX) caused 60% slowdown on Haswell chips!
  • BTW: there had no difference between /arch:SSE2 and /arch:AVX on Sandy Bridge
  • problem was cause by bottleneck in Cpu Store Buffer - I haven’t heard about that before, but using this thing carefully can give you a huge boost (or problems :))
  • Here is a nice looking link with some more info about Store Buffers on Sandy and Haswell
  • CPUs are so powerful that they can ‘analyze’ the code and sometimes this can introduce secondary such bugs. Need to know profiler tools to properly analyze such situations.

Wrap up:
Highly efficient code is actually memory efficient code.

Overally the presentation was great!

The pace of the presentation seemed to be quite slow, but this is actually good. That way you get more information stored. Definitely need to look for more presentation from Eric. They are, for instance, here on channel9.