Last Update:
How to Filter Elements - the Talk and Optimizations
Table of Contents
Two weeks ago, 20th May, I had a pleasure to talk about filtering elements on our Cracow C++ User Group online meeting.
Here are the slides and additional comments from the presentation.
Finally Restarted
After a few months of break, we finally restarted our Cracow’s C++ group!
Thus far we had two presentations in 2021:
- April: Preconditions, Postconditions, Invariants: How They Help Write Robust Programs
- May: N Different Ways to Filter Containers in Modern C++
And the plan is to have new meetings in June, July and probably in August… and then we’ll see :)
The Talk
I based my talk on two articles that I published this year:
- 12 Different Ways to Filter Containers in Modern C++ - C++ Stories
- Implementing Parallel copy_if in C++ - C++ Stories
While the topic might sound relatively simple, I found it a good way to cover various elements of Modern C++.
During the presentation I went from:
- raw loops
- basic algorithms
- remove erase idiom
to latest C++ features:
- consistent container erasure (
std::erase
,std::erase_if
) - concepts
- ranges
- making the function more generic with
if constexpr
And finally it’s also a good candidate to discuss parallel algorithms.
Filter Optimization - reserve()
Thanks to discussions during the presentation we spotted another optimization that I could add to the filtering code.
Is short: try to use vector.reserve()
as much as possible :)
For example:
std::vector<std::vector<T>> copiedChunks(chunks);
for_each(execution::par, begin(indexes), end(indexes), [&](size_t i) {
auto startIt = std::next(std::begin(vec), i * chunkLen);
auto endIt = std::next(startIt, chunkLen);
std::copy_if(startIt, endIt, std::back_inserter(copiedChunks[i]), p);
});
The above code processes in parallel chunks of data and puts the result in the temporary buffer: copiedChunks[i]
. Then the buffers are merged into the final output container.
If we add just one line - reserve
:
for_each(execution::par, begin(indexes), end(indexes), [&](size_t i) {
copiedChunks[i].reserve(chunkLen); // <<
auto startIt = std::next(std::begin(vec), i * chunkLen);
auto endIt = std::next(startIt, chunkLen);
std::copy_if(startIt, endIt, std::back_inserter(copiedChunks[i]), p);
});
The results?
// 4-core notebook, processing 100000 double elements
FilterCopyIfParChunksReserve 4.0733
FilterCopyIfParChunks 4.7641
// 6-core system
FilterCopyIfParChunksReserve 1.7926
FilterCopyIfParChunks 2.4202
And similarly the version with std::future
got just one line of reserve to the code that processes chunks:
// 4-core notebook, processing 100000 double elements
CopyIfParChunksFutureReserve 3.729
FilterCopyIfParChunksFuture 5.0829
// 6-core
CopyIfParChunksFutureReserve 1.5663
FilterCopyIfParChunksFuture 2.1709
As you can see limiting the number of memory allocations have a dramatic effect on the final performance.
Compare all versions
Thanks to a simple refactoring (thanks JFT!) I could grab all the timings and then present them in easier to read form:
struct Timing {
std::string name;
double time{};
size_t ret{};
};
template <typename TFunc>
void RunAndMeasure(const char* title, TFunc func, std::vector<Timing>& timings) {
const auto start = std::chrono::steady_clock::now();
auto ret = func();
const auto end = std::chrono::steady_clock::now();
DoNotOptimizeAway(ret);
const auto t = std::chrono::duration <double, std::milli>(end - start).count();
timings.emplace_back(title, t, ret);
}
And then:
RunAndMeasure("FilterCopyIfParComposeSeq ", [&testVec, &test]() {
auto filtered = FilterCopyIfParComposeSeq(testVec, test);
return filtered.size();
}, timings);
RunAndMeasure("FilterCopyIfParTransformPush", [&testVec, &test]() {
auto filtered = FilterCopyIfParTransformPush(testVec, test);
return filtered.size();
}, timings);
And then sort them:
std::ranges::sort(timings, {}, &Timing::time);
for (const auto& t : timings)
std::cout << t.name << ' ' << t.time << '\n';
Example results:
// 4 cores
benchmark vec size: 100000
transform only par 0.9143
FilterCopyIfParCompose 1.6469
FilterCopyIfParComposeSeq 1.7508
CopyIfParChunksFutureReserve 2.116
FilterCopyIfParTransformPush 2.2456
FilterCopyIfParChunksFuture 2.3864
FilterCopyIfParChunksReserve 2.5725
FilterCopyIfParChunks 3.267
transform only seq 3.6129
FilterEraseIf 4.9439
FilterCopyIf 5.4613
FilterCopyIfParNaive 8.2299
And here’s another run on my 6-core machine:
// 6 cores
benchmark vec size: 100000
transform only par 0.5735
FilterCopyIfParComposeSeq 1.3249
FilterCopyIfParCompose 1.4094
CopyIfParChunksFutureReserve 1.5663
FilterCopyIfParChunksReserve 1.7926
FilterCopyIfParTransformPush 1.8641
transform only seq 2.1457
FilterCopyIfParChunksFuture 2.1709
FilterCopyIfParChunks 2.4202
FilterEraseIf 3.3632
FilterCopyIf 3.6737
FilterCopyIfParNaive 9.6767
Here’s an interesting result:
FilterEraseIf 3.9558
FilterCopyIf 4.8056
It looks like it’s faster to copy the whole container and then erase elements, than add matching elements with push_back()
. I guess this is because of many memory allocations that happens with push_back()
and copu_if
. On the other hand when we create a copy, we have only a single memory allocation.
Code
You can find all of the code here @Github
https://github.com/fenbf/articles/tree/master/filterElements
The Slides
Here are the slides @Xara:
https://shared.xara.com/s71rTu9LcG
The Video
See below:
Invitation
If you like to hear more from the Cracow User Group, please join our meetup page. Thanks to the online presence we hope to be more “open” and allow to join the meeting even if you’re not in Cracow :)
I've prepared a valuable bonus if you're interested in Modern C++!
Learn all major features of recent C++ Standards!
Check it out here: