Scott Wolchok
The C++ standard library has many classes and functions that are easy to use and relatively safe. However, in situations where performance and efficiency really matter, straightforward use of the standard library may not always be the best choice. This post is the first in a series that will catalogue some of those opportunities for improvement. It is meant to be accessible to people who are not C++ gurus or language lawyers.
In each post, we will look at a piece of the C++ standard library that can cause performance problems and how we might be able to do better. Although the ability to read x86 assembly language is not required, I will provide links to code samples and their corresponding generated assembly on Compiler Explorer to explain why I give the advice I give. I will assume a 64-bit system and focus on x86_64, but most advice should apply to ARM64 as well.
std::vector
Common advice is to prefer std::vector
to C
arrays. This
is all well and good when we want to grow the vector
later. If we
don’t need to do that and performance is critical, we can often do
better for two reasons.
std::vector
RepresentationFirst of all, std::vector
is represented with three pointers: one to
the first element, one just past the last element, and one to the end of the
allocated space, like this:
template <typename T>
struct VectorRepr {
T *begin;
T *end;
T *capacityEnd;
};
When the size of the vector
won’t change, we will always have end == capacityEnd
(or worse, wasted space), so we can eliminate the
third pointer and save space by using a C array with a length (or end pointer) instead,
like this:
template <typename T>
struct MyArray {
std::unique_ptr<T[]> arr;
int length;
};
This can make a big difference in overall memory usage (and thus performance) if this thing is a member of a core data structure.
std::vector
Initialization OverheadSecond, the way we initialize our vector
matters. We’ll go through 2
iterations of initializing a vector
and then see what we can gain by
using an array instead.
To give ourselves a concrete example to work with, let’s say that
we’ll take a length N
and return a vector
containing the first N
squares:
std::vector<int> makeSquares(int howMany);
A straightforward way to write this function is with
reserve
and
push_back
:
std::vector<int> makeSquaresPushBack(int howMany) {
std::vector<int> result;
result.reserve(howMany);
for (int ii = 0; ii < howMany; ++ii) {
result.push_back(ii * ii);
}
return result;
}
(This way would be especially compelling if we wanted to use emplace_back
to construct new objects in place.)
Let’s use Compiler Explorer to see the assembly that clang 11 generates for this function. We can see that our inner loop (labelled .LBB0_6
) is checking to ensure that it hasn’t overrun the capacity of the vector
, and the compiler has emitted code (specifically, the operator new
, memmove
, and operator delete
calls) to reallocate if it has. Since we’ve used reserve
, we’ll never actually exceed our capacity. So, we’ve bloated our generated code for no good reason.
What if we try to avoid this by using the vector
constructor (item (4) in the linked reference) that creates it with the right size in the first place?
std::vector<int> makeSquaresPrealloc(int howMany) {
std::vector<int> result(howMany);
for (int ii = 0; ii < howMany; ++ii) {
result[ii] = ii * ii;
}
return result;
}
Now the generated
assembly
looks different: we no longer have the capacity check, and our inner
loop gets unrolled,
which might be nice if we’re making big vector
s. (It would also get
vectorized if
I hadn’t passed the -fno-vectorize
option to make the output easier
to read.) However, we still have an unnecessary call to
memset
near
the start of the function. This happens because the vector
constructor we used guarantees that elements are default constructed
(which, for int
, means initializing to zero): we can’t create a
vector
of uninitialized space, even though we know that we are about
to initialize the whole vector
.
Here is how I might write this to create an array instead:
struct IntArray {
std::unique_ptr<int []> arr;
int length;
};
IntArray makeSquaresArray(int howMany) {
IntArray result = {std::unique_ptr<int[]>{new int[howMany]}, howMany};
for (int ii = 0; ii < howMany; ++ii) {
result.arr[ii] = ii * ii;
}
return result;
}
If we skim the generated assembly, we can see that it is shorter overall, it does not contain an initial call to memset
, and the inner loop is very similar.
We do have one other trick available to us: if we know that the arrays
are never going to exceed some set size (let’s say we know it’s 7), we
can put them right inside our IntArray
object instead of on the
heap:
struct InlineIntArray {
std::array<int, 7> arr;
int size;
};
Now we save on both the heap allocation to create the array and the cache miss to access it. However, we have to be right about the upper bound on the size, and we really want to end up using most of the space most of the time. Whether this last step is worthwhile in practice depends on the specific application, so you’ll want to measure carefully before doing it.
By far the top request on the r/cpp discussion of this article was benchmarks. u/IHateUsernames111 kindly copied the examples into Quick Bench (which I learned about today!); results are here. Quick Bench reports that the prealloc version is 1.8x faster than push_back, and the array version is 1.2x faster than prealloc.
I do want to mention that I didn’t include benchmarks originally
because it is very easy for them to be misleading. For example, I
haven’t verified whether the cost of destroying the vector
or array
is included in the benchmark measurement. Whether it should be
included depends on the particular application.
I am a big fan of the Rust programming language, so I went ahead and translated the examples to Rust:
use std::convert::TryInto;
pub fn make_squares_push_back(how_many: i32) -> Vec<i32> {
let mut result = Vec::with_capacity(how_many as usize);
for idx in 0..how_many {
result.push(idx * idx);
}
result
}
pub fn make_squares_prealloc(how_many: i32) -> Vec<i32> {
let how_many : usize = how_many.try_into().unwrap();
let mut result = Vec::with_capacity(how_many);
result.resize(how_many, 0);
for idx in 0..how_many {
result[idx] = (idx * idx) as i32;
}
result
}
pub fn make_squares_functional(how_many: i32) -> Vec<i32> {
(0..how_many).map(|x| x * x).collect()
}
pub fn make_squares_functional_array(how_many: i32) -> Box<[i32]> {
(0..how_many).map(|x| x * x).collect()
}
The contrast was quite interesting. First, two notes on the translation:
Vec::with_capacity
, which is more convenient than calling Vec::new
followed by reserve
.Box<[i32]>
(i.e., a heap-allocated array) requires both nightly Rust and unsafe code, so I didn’t do it.Now, looking at the assembly:
make_squares_push_back
is similar to the C++ version: no unrolling
and checks for exceeding capacity.make_squares_prealloc
is a bit of a disappointment: we don’t get
loop unrolling, presumably because the compiler was unable to remove
the bounds checks on our indexing into the Vec
, even though we
just called resize
. We could remove the bounds check ourselves
using an unsafe
block and
get_unchecked_mut
,
but that seemed like going a bit too far.make_squares_functional
is the clear winner for this use case. The
code is short and clear, and the compiler also does the best job
with it. The loop gets unrolled and we don’t have capacity checks.make_squares_functional_array
seems good enough, although it
works
by creating a Vec
and then calling into_boxed_slice
on it, so we
don’t get the absolute fastest possible code. Perhaps the
FromIterator<I>
impl for Box<[I]>
will get improved once creating
an uninitialized Box<[T]>
doesn’t require nightly Rust?