Packing Bools, Performance tests

Table of Contents

Imagine you have an array of booleans (or an array of ‘conditions’), and you want to pack it - so you use only one bit per boolean. How to do it? Let’s do some experiments!

Updated: 8th May 2017

Read the second part here and also one update.

Motivation

I started writing this post because I came across a similar problem during my work some time ago. The code in one area of our system packed boolean results of a condition into bits. I wondered if I could optimize that process. This ‘algorithm’ is not a rocket science, but as usually, it opened a whole box of details and interesting solutions. So I decided to share it with my readers.

To illustrate the problem, we might think about an image in greyscale. We want to generate another image that has only two colors: white or black; we use a threshold value to distinguish between white and black color from the input image.

outputColor[x][y] = inputColor[x][y] > Threshold;

The input has some integer range (like 0…255), but the output is boolean: true/false.

Like here, image thresholding:

Then we want to pack those boolean values into bits so that we save a lot of memory. If bool is implemented as 8bit unsigned char, then we can save 7/8 of memory!

For example, instead of using 128kb for 256x512 greyscale image, we can now use 16kb only.

256 X 512 = 131072 (bytes) = 128kb
131072/8 = 16384 (bytes) = 16kb

Should be simple to code… right?

The algorithm

To make things clear let’s make some initial assumptions:

input:
- array of integer values
- length of the array: N
- threshold value
output:
- array of BYTES of the length M
- M - number of bytes needed to write N bits
- i-th bit of the array is set when inputArray[i] > threshold.

Brief pseudo code

for i = 0...N-1
    byte = pack (input[i] > threshold, 
                 input[i+1] > threshold, 
                 ..., 
                 input[i+7] > threshold)
    output[i/8] = byte
 i+=8

// handle case where N not divisible by 8

Alternatively, we might remove the threshold value and just take input array of booleans (so there won’t be any need to make comparisons).

Drawbacks of packing

Please note that I only focused on the ‘packing’ part. With the packed format you save memory, but there are more instructions to unpack a value. Sometimes this additional processing might cause the slow-down of the whole process! Always measure measure measure because each case might be different!

This problem is similar to compression algorithms, although packing is usually much faster process. As always, there’s a conflict between the storage and the computation power (Space–time tradeoff).

The benchmark

I want to compare several implementations:

the baseline: no packing: just store boolean values
std::bitset
std::vector of bools
one ‘manual’ version
second ‘manual’ version
threshold value: 127 so we’ll get 50% of chances to get true and true.

Plus, the next time we’ll also add parallel options…

For the benchmarking library, I decided to use Celero. You can find more details about using it in my post about Benchmarking Libs for C++.

With Celero there’s an easy way to express different options for the benchmark. So for example, I’d like to run my code against different sizes of the input array: like 100k, 200k, … Also, there’s a clean way to provide setUp/tearDown methods that will be invoked before each run.

The base fixture provides input array:

inputValues.reset(new int[N]);
referenceValues.reset(new bool[N]);
arrayLength = N;

//Standard mersenne_twister_engine seeded with 0, constant
std::mt19937 gen(0);
std::uniform_int_distribution<> dist(0, 255);

// set every byte
for (int64_t i = 0; i < experimentValue; ++i)
{
    inputValues[i] = dist(gen);
    referenceValues[i] = inputValues[i] > ThresholdValue;
}

The baseline

Originally I used the bitset version as a baseline, but that could be misleading. Thanks to the comments I’ve updated the benchmarks. It’s much better to see the ‘no packing’ version as the baseline, so that we can see if we gain something or not.

It might happen than versions with packing will perform slower than the simple approach.

The code is as follows:

for (size_t i = 0; i < arrayLength; ++i)
    outputValues[i] = inputValues[i] > ThresholdValue;

outputValues is an array of bool.

Do you like perf optimization topics? Signup to my newsletter for more.

std::bitset<N>

OK, this version will be really simple, take a look:

for (int64_t i = 0; i < arrayLength; ++i)
    outputBitset.set(i, inputValues[i] > ThresholdValue);

The only drawback of using bitset is that it requires compile time N constant. Also, bitset is implementation specific, so we’re not sure how the memory is laid out internally. I would reject this version from the final production code, but might be good for comparisons.

For example, here’s the fixture for this baseline benchmark:

class StdBitsetFixture : public CompressBoolsFixture
{
public:
    virtual void tearDown()
    {
        for (int64_t i = 0; i < arrayLength; ++i)
            Checker(outputBitset[i], referenceValues[i], i);
    }

    std::bitset<MAX_ARRAY_LEN> outputBitset;
};

In tearDown we check our generated values with the reference - Checker just checks the values and prints if something is not equal.

std::vector<bool>

Another simple code. But this time vector is more useful, as it’s dynamic and the code is still super simple.

for (int64_t i = 0; i < arrayLength; ++i)
    outputVector[i] = inputValues[i] > ThresholdValue;

And the fixture:

class StdVectorFixture : public CompressBoolsFixture
{
public:
    virtual void setUp(int64_t experimentValue) override
    {
        CompressBoolsFixture::setUp(experimentValue);

        outputVector.resize(experimentValue);
    }

    virtual void tearDown()
    {
        for (int64_t i = 0; i < arrayLength; ++i)
            Checker(outputVector[i], referenceValues[i], i);
    }

    std::vector<bool> outputVector;
};

This time, we generate the vector dynamically using experimentValue (N

the size of the array).

Remember that vector<bool> is a special implementation of the vector. It doesn’t contain array of bools, but it holds just bits (in an unspecified way). In terms of memory it should use much less space than unpacked version.

Still, vector<bool> might not be a good choice for the production code; see 17.1.1 Do not use std::vector | High Integrity C++ Coding Standard.

Manual version

The first two versions (and the baseline) were just to start with something, let’s now create some ‘real’ manual code :)

I mean ‘manual’ since all the memory management will be done but that code. Also, there won’t be any abstraction layer to set/get bits.

The setup looks like this:

virtual void setUp(int64_t experimentValue) override
{
    CompressBoolsFixture::setUp(experimentValue);
    numBytes = (experimentValue + 7) / 8;
    numFullBytes = (experimentValue) / 8;
    outputValues.reset(new uint8_t[numBytes]);
}

outputValue is just a unique_ptr to array of uint8_t. We have N/8 full bytes and also there’s one at the end that might be partially filled.

The first case will use just one variable to build the byte. When this byte is complete (8 bits are stored), we can save it in the output array:

uint8_t OutByte = 0;
int shiftCounter = 0;

auto pInputData = inputValues.get();
auto pOutputByte = outputValues.get();

for (int64_t i = 0; i < arrayLength; ++i)
{
    if (*pInputData > ThresholdValue)
        OutByte |= (1 << shiftCounter);

    pInputData++;
    shiftCounter++;

    if (shiftCounter > 7)
    {
        *pOutputByte++ = OutByte;
        OutByte = 0;
        shiftCounter = 0;
    }
}

// our byte might be incomplete, so we need to handle this:
if (arrayLength & 7)
    *pOutputByte++ = OutByte;

Improvement

The first manual version has a little drawback. As you see, there’s only one value used when doing all the computation. This is quite inefficient as there’s little use of instruction pipelining.

So I came up with the following idea:

uint8_t Bits[8] = { 0 };
const int64_t lenDivBy8 = (arrayLength / 8) * 8;

auto pInputData = inputValues.get();
auto pOutputByte = outputValues.get();

for (int64_t i = 0; i < lenDivBy8; i += 8)
{
    Bits[0] = pInputData[0] > ThresholdValue ? 0x01 : 0;
    Bits[1] = pInputData[1] > ThresholdValue ? 0x02 : 0;
    Bits[2] = pInputData[2] > ThresholdValue ? 0x04 : 0;
    Bits[3] = pInputData[3] > ThresholdValue ? 0x08 : 0;
    Bits[4] = pInputData[4] > ThresholdValue ? 0x10 : 0;
    Bits[5] = pInputData[5] > ThresholdValue ? 0x20 : 0;
    Bits[6] = pInputData[6] > ThresholdValue ? 0x40 : 0;
    Bits[7] = pInputData[7] > ThresholdValue ? 0x80 : 0;

    *pOutputByte++ = Bits[0] | Bits[1] | Bits[2] | Bits[3] | 
                     Bits[4] | Bits[5] | Bits[6] | Bits[7];
    pInputData += 8;
}
if (arrayLength & 7)
{
    auto RestW = arrayLength & 7;
    memset(Bits, 0, 8);
    for (long long i = 0; i < RestW; ++i)
    {
        Bits[i] = *pInputData == ThresholdValue ? 1 << i : 0;
        pInputData++;
    }
    *pOutputByte++ = Bits[0] | Bits[1] | Bits[2] | Bits[3] | Bits[4] | Bits[5] | Bits[6] | Bits[7];
}

What happened here?

Instead of working on one variable I used eight different variables where we store the result of the condition. However, there’s still a problem when doing that large OR. For now, I don’t know how to improve it. Maybe you know some tricks? (without using SIMD instructions…)

Results

Was I right with this approach of using more variables? Let’s see some evidence!

Intel i7 4720HQ, 12GB Ram, 512 SSD, Windows 10. Visual Studio 2017, 32bit

The optimized version (using separate variables) is roughly 5x faster that bitset and almost 3.5x faster than the first manual version!

The chart:

As it appeared, there’s also at least one more reason why the optimized version is faster. You can read more in another post: Curious case of branch performance. Basically the first version has branches while the optimized can use conditional move instructions - and in this case that improves perf.

Summary

Even such simple sounding problem caused me some problems when implementing (hopefully) correct benchmark! Initially I’ve chosen bitset as the baseline, but it’s much better to see no packing version. Now you can see that packing actually can slow things down (when using wrong data structures). My manual version seems to be a bit better - you can potentially save 7/8 of the required memory space, at pack data almost 20…30% faster than no packing version.

Without looking at the traces, profiles I optimized my first version by using more variables to compute the conditions. That way there was less data dependency and CPU could perform better.

Next time I’ll try to parallelize the code. How about using more threads or vector instructions? For example, I’ve found a really interesting instruction called: _mm_movemask_epi8… See you next week.

Code on github: fenbf/celeroTest/celeroCompressBools.cpp

I would be grateful if you could run the samples and provide me with your results! Let me know, so I can even provide you with the binaries for Windows.