How to Filter Elements - the Talk and Optimizations

Table of Contents

Two weeks ago, 20th May, I had a pleasure to talk about filtering elements on our Cracow C++ User Group online meeting.

Here are the slides and additional comments from the presentation.

Finally Restarted

After a few months of break, we finally restarted our Cracow’s C++ group!

Thus far we had two presentations in 2021:

And the plan is to have new meetings in June, July and probably in August… and then we’ll see :)

The Talk

I based my talk on two articles that I published this year:

While the topic might sound relatively simple, I found it a good way to cover various elements of Modern C++.

During the presentation I went from:

raw loops
basic algorithms
remove erase idiom

to latest C++ features:

consistent container erasure (std::erase, std::erase_if)
concepts
ranges
making the function more generic with if constexpr

And finally it’s also a good candidate to discuss parallel algorithms.

Filter Optimization - `reserve()`

Thanks to discussions during the presentation we spotted another optimization that I could add to the filtering code.

Is short: try to use vector.reserve() as much as possible :)

For example:

std::vector<std::vector<T>> copiedChunks(chunks);

for_each(execution::par, begin(indexes), end(indexes), [&](size_t i) {
    auto startIt = std::next(std::begin(vec), i * chunkLen);
    auto endIt = std::next(startIt, chunkLen);
    std::copy_if(startIt, endIt, std::back_inserter(copiedChunks[i]), p);
});

The above code processes in parallel chunks of data and puts the result in the temporary buffer: copiedChunks[i]. Then the buffers are merged into the final output container.

If we add just one line - reserve:

for_each(execution::par, begin(indexes), end(indexes), [&](size_t i) {
    copiedChunks[i].reserve(chunkLen); // <<
    auto startIt = std::next(std::begin(vec), i * chunkLen);
    auto endIt = std::next(startIt, chunkLen);
    std::copy_if(startIt, endIt, std::back_inserter(copiedChunks[i]), p);
});

The results?

// 4-core notebook, processing 100000 double elements
FilterCopyIfParChunksReserve 4.0733
FilterCopyIfParChunks        4.7641

// 6-core system
FilterCopyIfParChunksReserve 1.7926
FilterCopyIfParChunks        2.4202

And similarly the version with std::future got just one line of reserve to the code that processes chunks:

// 4-core notebook, processing 100000 double elements
CopyIfParChunksFutureReserve 3.729
FilterCopyIfParChunksFuture  5.0829

// 6-core
CopyIfParChunksFutureReserve 1.5663
FilterCopyIfParChunksFuture  2.1709

As you can see limiting the number of memory allocations have a dramatic effect on the final performance.

Compare all versions

Thanks to a simple refactoring (thanks JFT!) I could grab all the timings and then present them in easier to read form:

struct Timing {
    std::string name;
    double time{};
    size_t ret{};
};

template <typename TFunc> 
void RunAndMeasure(const char* title, TFunc func, std::vector<Timing>& timings) {
    const auto start = std::chrono::steady_clock::now();
    auto ret = func();
    const auto end = std::chrono::steady_clock::now();
    DoNotOptimizeAway(ret);

    const auto t = std::chrono::duration <double, std::milli>(end - start).count();

    timings.emplace_back(title, t, ret);
}

And then:

RunAndMeasure("FilterCopyIfParComposeSeq   ", [&testVec, &test]() {
    auto filtered = FilterCopyIfParComposeSeq(testVec, test);
    return filtered.size();
}, timings);

RunAndMeasure("FilterCopyIfParTransformPush", [&testVec, &test]() {
    auto filtered = FilterCopyIfParTransformPush(testVec, test);
    return filtered.size();
}, timings);

And then sort them:

std::ranges::sort(timings, {}, &Timing::time);

for (const auto& t : timings)
    std::cout << t.name << ' ' << t.time << '\n';

Example results:

// 4 cores
benchmark vec size: 100000
transform only par           0.9143
FilterCopyIfParCompose       1.6469
FilterCopyIfParComposeSeq    1.7508
CopyIfParChunksFutureReserve 2.116
FilterCopyIfParTransformPush 2.2456
FilterCopyIfParChunksFuture  2.3864
FilterCopyIfParChunksReserve 2.5725
FilterCopyIfParChunks        3.267
transform only seq           3.6129
FilterEraseIf                4.9439
FilterCopyIf                 5.4613
FilterCopyIfParNaive         8.2299

And here’s another run on my 6-core machine:

// 6 cores
benchmark vec size: 100000
transform only par           0.5735
FilterCopyIfParComposeSeq    1.3249
FilterCopyIfParCompose       1.4094
CopyIfParChunksFutureReserve 1.5663
FilterCopyIfParChunksReserve 1.7926
FilterCopyIfParTransformPush 1.8641
transform only seq           2.1457
FilterCopyIfParChunksFuture  2.1709
FilterCopyIfParChunks        2.4202
FilterEraseIf                3.3632
FilterCopyIf                 3.6737
FilterCopyIfParNaive         9.6767

Here’s an interesting result:

FilterEraseIf                3.9558
FilterCopyIf                 4.8056

It looks like it’s faster to copy the whole container and then erase elements, than add matching elements with push_back(). I guess this is because of many memory allocations that happens with push_back() and copu_if. On the other hand when we create a copy, we have only a single memory allocation.

Code

You can find all of the code here @Github

https://github.com/fenbf/articles/tree/master/filterElements

The Slides

Here are the slides @Xara:

https://shared.xara.com/s71rTu9LcG

The Video

See below:

Invitation

If you like to hear more from the Cracow User Group, please join our meetup page. Thanks to the online presence we hope to be more “open” and allow to join the meeting even if you’re not in Cracow :)

https://www.meetup.com/C-User-Group-Cracow/