C++17 in details: Parallel Algorithms

Table of Contents

Writing multithreaded code is hard. You want to utilize all of the machine’s processing power, keep code simple and avoid data races at the same time.

Let’s see how C++17 can make writing parallel code a bit easier.

Intro

With C++11/14 we’ve finally got threading into the standard library. You can now create std::thread and not just depend on third party libraries or a system API. What’s more, there’s also async processing with futures.

For example, in 2014 I wrote about using async tasks in this article: Tasks with std::future and std::async.

Multithreading is a significant aspect of modern C++. In the committee, there’s a separate “SG1, Concurrency” group that works on bringing more features to the standard.

What’s on the way?

Coroutines,
Atomic Smart pointers,
Transactional Memory,
Barriers,
Tasks blocks.
Parallelism
Compute
Executors
Heterogeneous programming models support
maybe something more?

And why do we want to bring all of those features?

There’s a famous talk from Sean Parent about better concurrency. It was a keynote at CppNow 2012, here’s a recent version from 2016 from code::dive 2016.

Do you know how much of the processing power of a typical desktop machine we can utilize using only the core version of C++/Standard Library?

50%,
100%?
10%?

Sean in his talk explained that we can usually access only around 0,25% with single-threaded C++ code and maybe a few percent when you add threading from C++11/14.

So where’s the rest of the power?

GPU and Vectorization (SIMD) from CPU.

Of course, some third party APIs allow you to access GPU/vectorization: for example, we have CUDA, OpenCL, OpenGL, vectorized libraries, etc. There’s even a chance that your compiler will try to auto-vectorize some of the code. Still, we’d like to have that kind of support directly from the Standard Library. That way common code can be used on many platforms.

With C++11/14 we got a lot of low-level features. But it’s still tough to use them effectively. What we need is an abstraction. Ideally, code should be auto-threaded/parallelized, of course with some guidance from a programmer.

C++17 moves us a bit into that direction and allows us to use more computing power: it unlocks the auto vectorization/auto parallelization feature for algorithms in the Standard Library.

Plus of course, not everything can be made parallel/multi threaded as there’s Amdahl’s law to contend with. So always using 100% (110% with CPU boost :)) of the machine power is only a theoretical case. Still, it’s better to strive for it rather than write everything single-threaded.

The Series

This post is the seventh in the series about C++17 features.

The plan for the series

Fixes and deprecation
Language clarification
Templates
Attributes
Simplification
Library changes - Filesystem
Library changes - Parallel STL (today)
Library changes - Utils
Wrap up, Bonus - with a free ebook! :)

Documents & Links

Just to recall:

First of all, if you want to dig into the standard on your own, you can read the latest draft here:

N4659, 2017-03-21, Draft, Standard for Programming Language C++

from isocpp.org.

Also, you can grab my list of concise descriptions of all of the C++17 - It’s a one-page reference card:

Download a free copy of my C++17 Cheat Sheet!

Links:

Compiler support: C++ compiler support
The official paper with changes: P0636r0: Changes between C++14 and C++17 DIS
There’s also a talk from Bryce Lelbach: C++Now 2017: C++17 Features
My master C++17 features post: C++17 Features
Jason Turner: C++ Weekly channel, where he covered most (or even all!) of C++17 features.

And the books:

OK, let’s discuss the parallel algorithms!

Overview

I’ve already mentioned the reasoning why we want to have so many ‘tools’ for multithreading/computing in the Standard.

The TS paper describing what was merged into the Standard: P0024R2

The new feature looks surprisingly simple from a user point of view. You just have a new parameter that can be passed to most of the std algorithms: this new parameter is the execution policy.

std::algorithm_name(policy, /* normal args... */);

I’ll go into the detail later, but the general idea is that you call an algorithm and then you specify how it can be executed. Can it be parallel, maybe vectorized, or just serial.

That hint is necessary because the compiler cannot deduce everything from the code (at least not yet :)). We, as authors of the code, only know if there are any side effects, possible race conditions, dead locks, or if there’s no sense in running it parallel (like if you have a small collection of items).

Current implementation

I hope this article will be soon updated, but for now, I have bad news.

Unfortunately, as of today, none of the major compilers support the feature.

Update: 20th Dec 2017: MSVC in the version 15.5.2 can support: all_of, any_of, for_each, for_each_n, none_of, reduce, replace, replace_if, sort.
See this post from VC blog

However you can play with the following implementations/API’s:

Codeplay: http://github.com/KhronosGroup/SyclParallelSTL
HPX: http://stellar-group.github.io/hpx/docs/html/hpx/manual/parallel.html
- You can have a look at Rainer’s article: C++17: New Parallel Algorithms where he used HPX for code samples.
Parallel STL - https://parallelstl.codeplex.com/
Intel - https://software.intel.com/en-us/get-started-with-pstl
n3554 - proposal implementation (inited by Nvidia) https://github.com/n3554/n3554
Thibaut Lutz: http://github.com/t-lutz/ParallelSTL

Execution policies

The execution policy parameter will tell the algorithm how it should be executed. We have the following options:

sequenced_policy - is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm’s execution may not be parallelized.
- the corresponding global object is std::execution::seq
parallel_policy - is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized.
- the corresponding global object is std::execution::par
parallel_unsequenced_policy - is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized and vectorized.
- the corresponding global object is std::execution::par_unseq

Note that those are unique types, with their corresponding global objects. It’s not just an enum.

Sequential execution seems obvious, but what’s the difference between par and par_unseq?

I like the example from Bryce Adelstein’s talk:

If we have a code like

double mul(double x,double y) {
    return x * y;
}

std::transform(
    // "Left" input sequence.
    x.begin(), x.end(),
    y.begin(), // "Right" input sequence.
    z.begin(),// Output sequence.
    mul);

The sequential operations that will be executed with the following instructions:

load x[i]
load y[i]
mul
store into z[i]

With the par policy the whole mul() for the i-th element will be executed on one thread, the operations won’t be interleaved. But different i can be on a different thread.

With par_unseq mul() each operation can be on a different thread, interleaved. In practice it can be vectorized like:

load x[i... i+3]
load y[i...i+3]
mul // four elements at once
store into z[i...i+3]

Plus, each of such vectorized invocation might happen on a different thread.

With par_unseq function invocations might be interleaved, so using vectorized unsafe code is not allowed: no mutexes or memory allocation… More on that here: @cppreference.

Also, the current approach allows you to provide nonstandard policies, so compiler/library vendors might be able to provide their extensions.

Let’s now see what algorithms were updated to handle the new policy parameter.

Algorithm update

Most of the algorithms (that operates on containers/ranges) from the Standard Library can handle execution policy.

What have we here?

adjacent difference, adjacent find.
all_of, any_of, none_of
copy
count
equal
fill
find
generate
includes
inner product
in place merge, merge
is heap, is partitioned, is sorted
lexicographical_compare
min element, minmax element
mismatch
move
n-th element
partial sort, sort copy
partition
remove + variations
replace + variations
reverse / rotate
search
set difference / intersection / union /symmetric difference
sort
stable partition
swap ranges
transform
unique

The full list can be found here: @cppreference.

A simple example:

std::vector<int> v = genLargeVector();

// standard sequential sort
std::sort(v.begin(), v.end());

// explicitly sequential sort
std::sort(std::seq, v.begin(), v.end());

// permitting parallel execution
std::sort(std::par, v.begin(), v.end());

// permitting vectorization as well
std::sort(std::par_unseq, v.begin(), v.end());

New algorithms

A few existing algorithms weren’t ‘prepared’ for parallelism, but instead we have new, similar versions:

for_each- similar to std::for_each except returns void.
for_each_n - applies a function object to the first n elements of a sequence.
reduce - similar to std::accumulate, except out of order execution.
exclusive_scan - similar to std::partial_sum, excludes the i-th input element from the i-th sum.
inclusive_scan - similar to std::partial_sum, includes the i-th input element in the i-th sum
transform_reduce - applies a functor, then reduces out of order
transform_exclusive_scan - applies a functor, then calculates exclusive scan
transform_inclusive_scan - applies a functor, then calculates inclusive scan

For example, we can use for_each (or new for_each_n) with an execution policy, but assuming we don’t want to use the return type of the original for_each.

Also, there’s an interesting case with reduce. This new algorithm provides a parallel version of accumulate. But it’s important to know the difference.

Accumulate returns the sum of all the elements in a range (or a result of a binary operation that can be different than just a sum).

std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

int sum = std::accumulate(v.begin(), v.end(), /*init*/0);

The algorithm is sequential only; a parallel version will try to compute the final sum using a tree approach (sum sub-ranges, then merge the results, divide and conquer). Such method can invoke the binary operation/sum in a nondeterministic order. Thus if binary_op is not associative or not commutative, the behavior is also non-deterministic.

For example, we’ll get the same results for accumulate and reduce for a vector of integers (when doing a sum), but we might get a slight difference for a vector of floats or doubles. That’s because floating point operations are not associative.

Summary

Is that the end for today?

Multithreading/Concurrency/Parallelism are huge topics to discover and understand. I’ll hope to return with some more examples (possibly with some working implementation in common compilers!). So for now, I’ve described only the tip of an iceberg :)

From this post, I’d like you to remember that concurrency/parallelism is one of the key areas in the C++ standard and a lot of work is being done to bring more features.

With C++17 we get a lot of algorithms that can be executed in a parallel/vectorized way. That’s amazing, as it’s a solid abstraction layer. With this making, apps is much easier. A similar thing could possibly be achieved with C++11/14 or third-party APIs, but now it’s all in the standard.

Do you use any other parallel libraries? CUDA? SYCL? Intel TBB? Something else?
Do you try to make you code multi threading or write most of the code single threaded?

Below I’ve also gatherd a few valuable resources/articles/talks so that you can learn more.

Resources

The original paper for the spec: P0024R2

The initial TS paper: PDF: A Parallel Algorithms Library | N3554

ModernesCpp articles about parallel STL:

Bryce Adelstein’s talk about parallel algorithms. Contains a lot of examples for map reduce
(transform reduce) algorithm:

And the Sean Parent talk about better concurrency in C++