Last Update:
Randomizing lines from a file, a tool in VC++ 2017
Table of Contents
Is C++ well suited for writing fast small utilities/tools?
Let’s see:
For my recent giveaway I needed a tool that would take an input file - CSV with data and then draw a few winners from all of the entries. To make things more complicated each entry might have a different weight.
Read more for the full description, my solution and experiments.
The Problem
In short:
I have all of the entries to the giveaway (in a file), I need to pick (randomly) 3 winners.
More details:
All of the entries can be exported into CSV. The file has the following structure:
The field Entries
is actually a weight of a user. 1 is the default. If
you see weight like 10 it means that the user is 10x likely to win than
a user with entries = 1
.
How to randomize such structure and pick the winners?
For a start we can load all of the lines (skip the header), then randomize/shuffle the entries and just look at first N (in our case 3) top entries.
The problem is the weight parameter.
In our case, It might be relatively easy as the weight is not a double
or float… it’s just an integer value. What we can do is to duplicate the
entries as many times as the eight
param suggests.
For example:
If there’s a weight of 2 we need to have 2 copies of a given entry.
Then we can shuffle such structure and then users with weights > 1
should have bigger chance to win.
The Solution
I don’t know about you, but I usually didn’t memorize code to load/process a file… but I remembered where could I get some basic parts for my project.
Some time ago there was a post from Meeting C++: Randomizing a CSV File with Standard C++.
Sounds like a good start for me… right?
I didn’t hesitate to take some parts of it and I started my project.
The repo: fenbf/RandomMachine.
As you might expect, it’s not a huge project! I’m not using any super advanced data structures, class hierarchies or complex systems. The tool should work and should be relatively fast to write.
What’s the overall structure?
Let’s have a look at main
:
int main(int argc, const char *argv[])
{
try
{
const auto inputParams = ReadInputParams(argc, argv);
const auto vecLineEntries = ReadAllLines(inputParams);
auto vecIdWithWeights = BuildFromEntries(vecLineEntries);
ShuffleWithMT19937(begin(vecIdWithWeights), end(vecIdWithWeights));
const auto vecWinners = DrawWinners(vecLineEntries,
vecIdWithWeights,
inputParams.numElementsToPick);
ShowWinners(vecWinners, vecLineEntries, inputParams);
}
catch (...)
{
}
return 0;
}
Core parts:
- It’s a command line tool, there are several parameters:
- inputFileName
- UsersToPick
- LinesToSkip
- Column ID of the weight param
- Column separator
- I know that I should be using Boost.ProgramOptions, but this time I wrote my own simple parsing.
ReadAllLines
will open a file and parse it. It produces a vector of all entries. Each entry has a string - with the line text and also a weight param (by default it’s one).- Based on entries we build an additional index vector. Here each line
entry is expanded/duplicated based on the weight param. But it’s
only id, not the full copy.
- For example if there are two entries, with weights of 2 and 3,
we’ll get something like
{0, 0, 1, 1, 1 }
.
- For example if there are two entries, with weights of 2 and 3,
we’ll get something like
- The index vector is then shuffled
- After shuffling we can just look at the top N entries. But we need
to take care of drawing only unique winners. Due to the weight it
might happen that we pick the same user twice… or more. So if that
happens we just look at another entry.
- As noted in one of the comments: shuffling is probably not optimal solution. A better way would be to just pick one element at random, then mark it and then draw another one (until I reach number of winners). It doesn’t matter when number of lines/winners are relatively small (like up to 100), but when number of winners (elements to pick) is small and number of lines is larger then that’s much better choice.
- After we draw from the collection, we just have to present it.
- Simplified error handling - using exceptions - is added.
- I tired to be
const
correct and usedconst
whenever possible: Please declare your variables as const.
Interesting Parts
In terms of details let’s see how the final function of drawing is built:
vector<LineEntry>
DrawWinners(const vector<LineEntry>& vecInputLines,
const vector<int>& vecWeightedIndices,
unsigned int numElementsToPick)
{
unsigned int winnersFound = 0;
std::set<int> setWinners;
std::vector<LineEntry> outVec;
for (unsigned int i = 0;
winnersFound < numElementsToPick && i < vecWeightedIndices.size();
++i)
{
const auto index = vecWeightedIndices[i];
const auto &entry = vecInputLines[index];
if (setWinners.find(index) == setWinners.end())
{
setWinners.insert(index);
outVec.push_back(entry);
winnersFound++;
}
}
return outVec;
}
So the above code is responsible for drawing top N entries using
randomized index vector. The shuffling part is done before the call to
the function. The only little complication is to avoid duplicates of
winners. I am using a separate set
to mark if a entry is already a
winner or not.
Then we just need to output the selected lines.
What are other interesting parts in terms of C++ and Visual Studio?
Modern C++
What’s used from modern C++?
- auto wherever possible
- non static data member initialization
- uniform initialization
- random: std::random_shuffle is deprecated in C++14 - Meeting
C++,
and since I got that randomizing code from Meeting C++ it already
used
mt19937
. The only thing I did was to wrap shuffling into a simple template function:
template <typename ItRandom>
void ShuffleWithMT19937(ItRandom itFirst, ItRandom itLast)
{
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(itFirst, itLast, g);
}
string_view
- have a look at this great post: string_view | Growing up. I was able to use a fewstring_views
across the whole code. But I need to get used to it.- So far I’ve noticed that there’s a problem with numeric
conversions directly from a
string_view
. So I needed to make a copy of the string first and then do the conversion (like callingstd::atoi
).
- So far I’ve noticed that there’s a problem with numeric
conversions directly from a
- For each loops.
- move semantics, returning by value, not by output ref/ptr parameter (with also a chance of using Copy Elision).
Visual Studio 2017
With Visual Studio 2017 it’s really easy to write such code. The whole IDE works just better, faster. A lot of tool - even basic refactoring - are there.
For more about VS you can read:
- What’s New for Visual C++ in Visual Studio | Microsoft Docs
- What’s new in Visual Studio 2017 for C++ Development - Marius Bancila
I was happy to see that writing unit tests for native code is as simple as for managed languages. The native unit testing framework makes life so much easier. It just works!
Todo / Experiments
Want to know the best about such pet projects?
You can experiment with it!
How about adding Modules?
In Visual Studio 2017 there’s already early module support. See here
Using C++ Modules in Visual Studio 2017 | Visual C++ Team
Blog.
There’s std.core
that brings the Standard Library, so that should work
with my sample.
What’s more to add?
I definitely need to add more unit tests, as currently maybe like 50% of code is covered. And not all edge cases are included. Native unit testing framework is really super simple to use.
Soon VC++ 2017.3 will be released (there already preview), and we should get the following big features from C++17:
- structured bindings
- guaranteed copy elision
- constexpr if-statements
- Selection statements with initializers
For more look here: C++17 Features In Visual Studio 2017 Version 15.3 Preview
It would be great to use structured bindings and selection statements with initializer, just to see how they work in such simple code.
Possibly, if I try hard, I could even came up with an example for
constexpr-if
.
Any suggestions how to improve my amazing project? :)
Summary
I hope you already know that C++ is also good for writing small utilities.
Maybe such project would be simpler or smaller in Python or C#? I don’t know… but I don’t expect to see a huge difference. I didn’t use explicit memory management, only standard containers, basic exception handling. So the whole app should be quite safe and won’t leak.
Do you write simple tools in C++ or use some different language?
Any suggestions how I could refactor improve the code?
Maybe you have a better solution?
I've prepared a valuable bonus if you're interested in Modern C++!
Learn all major features of recent C++ Standards!
Check it out here: