Scott Wolchok

C++ Performance Trap #1: Constant-size std::vector

Posted at — Jan 14, 2021

C++ Performance Traps Series Introduction

The C++ standard library has many classes and functions that are easy to use and relatively safe. However, in situations where performance and efficiency really matter, straightforward use of the standard library may not always be the best choice. This post is the first in a series that will catalogue some of those opportunities for improvement. It is meant to be accessible to people who are not C++ gurus or language lawyers.

In each post, we will look at a piece of the C++ standard library that can cause performance problems and how we might be able to do better. Although the ability to read x86 assembly language is not required, I will provide links to code samples and their corresponding generated assembly on Compiler Explorer to explain why I give the advice I give. I will assume a 64-bit system and focus on x86_64, but most advice should apply to ARM64 as well.

C++ Performance Trap #1: Constant-size std::vector

Common advice is to prefer std::vector to C arrays. This is all well and good when we want to grow the vector later. If we don’t need to do that and performance is critical, we can often do better for two reasons.

std::vector Representation

First of all, std::vector is represented with three pointers: one to the first element, one just past the last element, and one to the end of the allocated space, like this:

template <typename T>
struct VectorRepr {
    T *begin;
    T *end;
    T *capacityEnd;
};

When the size of the vector won’t change, we will always have end == capacityEnd (or worse, wasted space), so we can eliminate the third pointer and save space by using a C array with a length (or end pointer) instead, like this:

template <typename T>
struct MyArray {
    std::unique_ptr<T[]> arr;
    int length;
};

This can make a big difference in overall memory usage (and thus performance) if this thing is a member of a core data structure.

std::vector Initialization Overhead

Second, the way we initialize our vector matters. We’ll go through 2 iterations of initializing a vector and then see what we can gain by using an array instead.

To give ourselves a concrete example to work with, let’s say that we’ll take a length N and return a vector containing the first N squares:

std::vector<int> makeSquares(int howMany);

A straightforward way to write this function is with reserve and push_back:

std::vector<int> makeSquaresPushBack(int howMany) {
    std::vector<int> result;
    result.reserve(howMany);
    for (int ii = 0; ii < howMany; ++ii) {
        result.push_back(ii * ii);
    }
    return result;
}

(This way would be especially compelling if we wanted to use emplace_back to construct new objects in place.)

Let’s use Compiler Explorer to see the assembly that clang 11 generates for this function. We can see that our inner loop (labelled .LBB0_6) is checking to ensure that it hasn’t overrun the capacity of the vector, and the compiler has emitted code (specifically, the operator new, memmove, and operator delete calls) to reallocate if it has. Since we’ve used reserve, we’ll never actually exceed our capacity. So, we’ve bloated our generated code for no good reason.

What if we try to avoid this by using the vector constructor (item (4) in the linked reference) that creates it with the right size in the first place?

std::vector<int> makeSquaresPrealloc(int howMany) {
    std::vector<int> result(howMany);
    for (int ii = 0; ii < howMany; ++ii) {
        result[ii] = ii * ii;
    }
    return result;
}

Now the generated assembly looks different: we no longer have the capacity check, and our inner loop gets unrolled, which might be nice if we’re making big vectors. (It would also get vectorized if I hadn’t passed the -fno-vectorize option to make the output easier to read.) However, we still have an unnecessary call to memset near the start of the function. This happens because the vector constructor we used guarantees that elements are default constructed (which, for int, means initializing to zero): we can’t create a vector of uninitialized space, even though we know that we are about to initialize the whole vector.

Here is how I might write this to create an array instead:

struct IntArray {
    std::unique_ptr<int []> arr;
    int length;
};

IntArray makeSquaresArray(int howMany) {
    IntArray result = {std::unique_ptr<int[]>{new int[howMany]}, howMany};
    for (int ii = 0; ii < howMany; ++ii) {
        result.arr[ii] = ii * ii;
    }
    return result;
}

If we skim the generated assembly, we can see that it is shorter overall, it does not contain an initial call to memset, and the inner loop is very similar.

We do have one other trick available to us: if we know that the arrays are never going to exceed some set size (let’s say we know it’s 7), we can put them right inside our IntArray object instead of on the heap:

struct InlineIntArray {
    std::array<int, 7> arr;
    int size;
};

Now we save on both the heap allocation to create the array and the cache miss to access it. However, we have to be right about the upper bound on the size, and we really want to end up using most of the space most of the time. Whether this last step is worthwhile in practice depends on the specific application, so you’ll want to measure carefully before doing it.

Benchmarks

By far the top request on the r/cpp discussion of this article was benchmarks. u/IHateUsernames111 kindly copied the examples into Quick Bench (which I learned about today!); results are here. Quick Bench reports that the prealloc version is 1.8x faster than push_back, and the array version is 1.2x faster than prealloc.

I do want to mention that I didn’t include benchmarks originally because it is very easy for them to be misleading. For example, I haven’t verified whether the cost of destroying the vector or array is included in the benchmark measurement. Whether it should be included depends on the particular application.

Postscript: What if we used Rust?

I am a big fan of the Rust programming language, so I went ahead and translated the examples to Rust:

use std::convert::TryInto;

pub fn make_squares_push_back(how_many: i32) -> Vec<i32> {
    let mut result = Vec::with_capacity(how_many as usize);
    for idx in 0..how_many {
        result.push(idx * idx);
    }
    result
}

pub fn make_squares_prealloc(how_many: i32) -> Vec<i32> {
    let how_many : usize = how_many.try_into().unwrap();
    let mut result = Vec::with_capacity(how_many);
    result.resize(how_many, 0);
    for idx in 0..how_many {
        result[idx] = (idx * idx) as i32;
    }
    result
}

pub fn make_squares_functional(how_many: i32) -> Vec<i32> {
    (0..how_many).map(|x| x * x).collect()
}

pub fn make_squares_functional_array(how_many: i32) -> Box<[i32]> {
    (0..how_many).map(|x| x * x).collect()
}

The contrast was quite interesting. First, two notes on the translation:

Now, looking at the assembly: