Last Update:
The Amazing Performance of C++17 Parallel Algorithms, is it Possible?
Table of Contents
With the addition of Parallel Algorithms in C++17, you can now easily update your “computing” code to benefit from parallel execution. In the article, I’d like to examine one STL algorithm which naturally exposes the idea of independent computing. If your machine has 10-core CPU, can you always expect to get 10x speed up? Maybe more? Maybe less? Let’s play with this topic.
Update 13th Nov: I’ve applied the comments from r/cpp discussions, used proper ranges for trigonometry/sqrt computations, and some minor changes. The benchmarks were executed another time.
Intro to Parallel Algorithms
C++17 offers the execution policy parameter that is available for most of the algorithms:
sequenced_policy
- is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm’s execution may not be parallelized.- the corresponding global object is
std::execution::seq
- the corresponding global object is
parallel_policy
- is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized.- the corresponding global object is
std::execution::par
- the corresponding global object is
parallel_unsequenced_policy
- is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized and vectorized.- the corresponding global object is
std::execution::par_unseq
- the corresponding global object is
In short:
- use
std::execution::seq
to execute your algorithm sequential - use
std::execution::par
to execute your algorithm in parallel (usually using some Thread Pool implementation) - use
std::execution::par_unseq
to execute your algorithm in parallel with also ability to use vector instructions (like SSE, AVX)
As a quick example you can invoke std::sort
in a parallel way:
std::sort(std::execution::par, myVec.begin(), myVec.end());
// ^^^^^^^^^^^^^^^^^^^
// execution policy
Please notice that it’s so easy just to add parallel execution parameter to an algorithm! But can you always experience a huge performance boost? Is it always faster? Or maybe there are cases where it might slow things down?
Parallel std::transform
In this post I’d like to have a look at std::transform
algorithm that
potentially might be one of the building blocks of other parallel
techniques (along with std::transform_reduce
, for_each
, scan
,
sort
…).
Our test code will revolve around the following pattern.
std::transform(execution_policy, // par, seq, par_unseq
inVec.begin(), inVec.end(),
outVec.begin(),
ElementOperation);
Assuming the ElementOperation
function doesn’t use any method of
synchronisation, then the code might have a good potential to be
executed in parallel or even vectorised. Each computation for an element
is independent, the order is not important, so the implementation might
spawn multiple threads (possibly on a thread pool) to process elements
independently.
I’d like to experiment with the following cases.
- size of the vector - big or small
- simple transformations that spend time mostly on memory access
- more arithmetic (ALU) operations
- ALU in a more realistic scenario
As you can see, I’d like not only to test the number of elements that is “good” to use a parallel algorithm, but also ALU operations that keep CPU busy.
Other algorithms like sorting, accumulate (in the form of std::reduce
)
also offers parallel execution, but they require more work (and usually
merging steps) to compute the results. So they might be candidates for
another article.
Note on Benchmarks
I’m using Visual Studio 2017, 15.8 for my tests - as it’s the only
implementation in a popular compiler/STL implementation at the moment
(Nov 2018) (GCC on the way!). What’s more, I focused only on
execution::par
as execution::par_unseq
is not available in MSVC
(works the same way as execution::par
).
I have two machines:
- i7 8700 - PC, Windows 10, i7 8700 - clocked at 3.2 GHz, 6 cores/12 threads (Hyperthreading)
- i7 4720 - Notebook, Windows 10, i7 4720, clocked at 2.6GHz, 4 cores/8 threads (Hyperthreading)
the code is compiled in x64, Release more, auto vectorisation is enabled by default, and I’ve enabled enhanced instruction set (SSE2), as well as OpenMP (2.0)
The code is located on my github:
github/fenbf/ParSTLTests/TransformTests/TransformTests.cpp
For OpenMP (2.0) I’m only using parallel for loops:
#pragma omp parallel for
for (int i = 0; ...)
I’m running the code section 5 times, and I look at the min numbers.
Warning: The results are shown only to present some rough observations, and please run it on your system/configuration before using it in production. Your requirements and environment might be different than mine.
You can read more about MSVC implementation in this post:
Using C++17 Parallel Algorithms for Better Performance | Visual C++
Team
Blog
And here’s a recent Billy O’Neil’s talk at CppCon 2018 (Billy
implemented Parallel STL in MSVC):
https://www.youtube.com/watch?v=nOpwhTbulmk
OK, let’s start with some basic examples!
Simple Transformation
Consider a case where you apply a really simple operation on the input vector. It might be a copy or a multiplication of elements.
For example:
std::transform(std::execution::par,
vec.begin(), vec.end(), out.begin(),
[](double v) { return v * 2.0; }
);
My machine has 6 or 4 cores… can I expect to get 4…6x perf of a sequential execution?
Here are the results (time in milliseconds):
As you see on the faster machine, you need like 1 million elements to start seeing some performance gains. On the other hand on my notebook, all parallel implementations were slower.
All in all, as might guess there’s a weak chance we’ll any considerable speed-up using such transformations, even when we increase the number of elements.
Why is that?
Since the operations are elementary, CPU cores can invoke it almost immediately, using only a few cycles. However, CPU cores spend more time waiting for the main memory. So, in that case, they all be mostly waiting, not computing.
Reading or writing to a variable in memory takes only 2-3 clock cycles if it is cached, but several hundred clock cycles if it is not cached
https://www.agner.org/optimize/optimizing_cpp.pdf
We can give a rough observation that if your algorithm is memory bound, then you cannot expect to have a better performance with the parallel execution.
More Computations
Since memory throughput is essential and might slow things down… let’s increase the number of computations that affect each element.
The idea is that it’s better to use CPU cycles rather than spend time waiting for memory.
For a start, I’ll use trigonometry functions, for example,
sqrt(sin*cos)
(those are arbitrary computations, not optimal form,
just to keep CPU busy).
We’re using sqrt
, sin
and cos
which might take up ~20 per sqrt,
~100 per a trigonometry function. That amount of computation might cover
the latency on the memory access.
More about instruction latencies in this great Perf Guide from Agner Fog
Here’s the benchmark code:
std::transform(std::execution::par, vec.begin(), vec.end(), out.begin(),
[](double v) {
return std::sqrt(std::sin(v)*std::cos(v));
}
);
How about now? Can we get some better perf than our previous attempt?
Here are the results (time in milliseconds):
Now, we’re finally seeing some nice numbers :)
For 1000 elements (not shown here), the timings for parallel and sequential were similar, so above 1000 elements, we can see some improvements for the parallel version.
For 100k elements, the faster machine performs almost 9x faster than the sequential version (similarly for OpenMP version).
For the largest set of a million elements - it’s 5x or 8x faster.
For such computations, I could achieve the speed-up that is “linear” to my CPU core count. Which is probably what we should expect.
Fresnel and 3D Vectors
In the section above I’ve used some “imaginary” computations, but how about some real code?
Let’s compute Fresnel equations that describe reflection and refraction of light at uniform planar interfaces. It’s a popular technique for generating realistic lightning in 3D games.
Photo from
Wikimedia
As a good reference I’ve found this great description and the
implementation:
Introduction to Shading (Reflection, Refraction and
Fresnel)
@scratchapixel.com
About Using GLM library
Rather than creating my own implementation, I’ve used the glm
library. I’ve used it a lot in
my OpenGL projects.
The library is available easily through Conan Package Manager, so I’ll be using that as well:
The link to the package: https://bintray.com/bincrafters/public-conan/glm%3Ag-truc
Conan file:
[requires]
glm/0.9.9.1@g-truc/stable
[generators]
visual_studio
and the command line to install the library (it will generate props file that I can use with my Visual Studio project)
conan install . -s build_type=Release -if build_release_x64 -s arch=x86_64
The library is header only, so it’s also easy to download it manually if you prefer.
The actual code & benchmark
I’ve adapted the code for glm
from
scratchapixel.com:
// implementation adapted from https://www.scratchapixel.com
float fresnel(const glm::vec4 &I, const glm::vec4 &N, const float ior)
{
float cosi = std::clamp(glm::dot(I, N), -1.0f, 1.0f);
float etai = 1, etat = ior;
if (cosi > 0) { std::swap(etai, etat); }
// Compute sini using Snell's law
float sint = etai / etat * sqrtf(std::max(0.f, 1 - cosi * cosi));
// Total internal reflection
if (sint >= 1)
return 1.0f;
float cost = sqrtf(std::max(0.f, 1 - sint * sint));
cosi = fabsf(cosi);
float Rs = ((etat * cosi) - (etai * cost)) /
((etat * cosi) + (etai * cost));
float Rp = ((etai * cosi) - (etat * cost)) /
((etai * cosi) + (etat * cost));
return (Rs * Rs + Rp * Rp) / 2.0f;
}
The code uses a few maths instructions, dot product, multiplications, divisions, so that should keep CPU busy as well. Rather than a vector of doubles we’re also using 4-element vectors, so the memory used has also increased.
The benchmark:
std::transform(std::execution::par,
vec.begin(), vec.end(), vecNormals.begin(), // input vectors
vecFresnelTerms.begin(), // output term
[](const glm::vec4& v, const glm::vec4& n) {
return fresnel(v, n, 1.0f);
}
);
Here are the results (time in milliseconds):
With the “real” computations we can see that parallel algorithms offer good performance. On my two Windows machines, for such operations, I could get speed-up that is almost linear to the number of cores.
For all tests I also showed you result from OpenMP and both implementations: MSVC and OpenMP seem to perform similarly.
Summary
In the article, I’ve shown three cases where you can start using
parallel execution and parallel algorithms. While replacing all standard
algorithms with just their std::execution::par
version might be
tempting, it’s not always a good way to do that! Each operation that you
use inside an algorithm might perform differently and be more CPU or
Memory bound, and that’s why you have to consider each change
separately.
Things to remember
- parallel execution will, in general, do more work than the sequential version, it’s because the library has to prepare the parallel execution
- it’s not only the count of elements that is important but also the number of instructions that keeps CPU busy
- it’s best to have tasks that don’t depend on each other nor other shared resources
- parallel algorithms offers a straightforward way to spawn work into separate threads
- if your operations are memory bound that you cannot expect much performance increase, or in some cases, the algorithm might be slower
- to get decent performance increase always measure the timings for each problem, as in some cases the results might be completely different
Special Thanks to JFT for help with the article!
For more references you can also have a look at my other resources about parallel algorithms:
- Fresh chapter in my C++17 In Detail Book about Parallel Algorithms.
- Parallel STL And Filesystem: Files Word Count Example
- Examples of Parallel Algorithms From C++17
Have a look at another article related to Parallel Algorithms: How to Boost Performance with Intel Parallel STL and C++17 Parallel Algorithms
Your Turn
What’s the answer to my question from the title? Can we get the amazing performance from parallel algorithms?
Have you played with the parallel execution? Did it bring the expected speed up?
In the article I’ve only touched “simple” parallel algorithms -
std::transform
. Things get even more complicated when we talk about
std::reduce
.
I've prepared a valuable bonus if you're interested in Modern C++!
Learn all major features of recent C++ Standards!
Check it out here: