Inlining and Compiler Optimizations

Posted at — Jan 23, 2021

Why is inlining so important in C++? Clearly, it reduces function call overhead: if a function is inlined, there is no need to spend time setting up its arguments, jumping to it, creating a stack frame, and then undoing all that upon returning. More interestingly, though, inlining enables other compiler optimizations. In this article, I will show examples of constant propagation and loop-invariant code motion (LICM). Then, I will explain how inlining enables these optimizations to apply more widely and show an example of the consequences when that doesn’t happen.

Constant propagation

Constant propagation turns references to variables whose value is known into uses of those constant values instead. For example, instead of a load from a memory location, the compiler may generate a use of an immediate value. Consider the following example:

#include <cstdio>

int nonConstNum = 23;

// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}

void printNonConst() {
    printNum(nonConstNum);
}

void printConst() {
    const int constNum = 42;
    printNum(constNum);
}

void printEffectivelyConst() {
    int effectivelyConstNum = 123;
    printNum(effectivelyConstNum);
}

void printInlineConstant() {
    printNum(123);
}

If we look at the generated assembly with -O3, we can see that printNonConst is loading from memory, because nonConstNum really isn’t constant: some other file in the program may declare extern int nonConstNum; and mutate it. Every other example is using a mov with an immediate to set the argument to printNum, because the compiler can see that their values are constants.

Loop-invariant code motion

Loop-invariant code motion, as its name suggests, moves expressions that are constant across iterations of a loop (hence “loop-invariant”) outside the loop. This is sometimes referred to as “hoisting” code, because the code is moved “above” the loop body. For example:

#include <span>

void multiplyArrayByTwoConstants(std::span<int> arr, int x, int y) {
    for (int& element: arr) {
        element *= x * y;
    }
}

The expression x * y does not change throughout the loop, so the compiler moves that multiplication to the start of the generated assembly (see line 4) to avoid calculating it repeatedly.

Inlining and constant propagation

Inlining enables compiler optimizations like constant propagation and LICM to be performed even in the presence of function or method calls. Let’s add a toy abstraction to our constant propagation example:

#include <cstdio>

class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept {
    return val;
  };
 private:
  int val;
};

MyInt nonConstNum(23);

// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}

void printNonConst() {
    printNum(nonConstNum.get());
}

void printConst() {
    const MyInt constNum(42);
    printNum(constNum.get());
}

void printEffectivelyConst() {
    MyInt effectivelyConstNum(123);
    printNum(effectivelyConstNum.get());
}

void printInlineConstant() {
    printNum(MyInt(123).get());
}

If we look at it side-by-side with the original code, we can see that we get the same assembly output despite adding the MyInt::get() function.

But what if we aren’t able to inline MyInt::get()? For example, it might be implemented in a different .cpp file:¹

#include <cstdio>

class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept;
 private:
  int val;
};

MyInt nonConstNum(23);

// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}

void printNonConst() {
    printNum(nonConstNum.get());
}

void printConst() {
    const MyInt constNum(42);
    printNum(constNum.get());
}

void printEffectivelyConst() {
    MyInt effectivelyConstNum(123);
    printNum(effectivelyConstNum.get());
}

void printInlineConstant() {
    printNum(MyInt(123).get());
}

Now, the generated assembly is much worse: in all cases, we have to construct a MyInt on the stack and make an actual function call to MyInt::get().

What about if MyInt::get() is virtual but inline? With our example as written, that is fine and the compiler can still inline it because it knows that the actual type of our object is MyInt.

If, however, we happen to have a pointer or reference to our MyInt, like this:

#include <cstdio>

class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  virtual int get() const noexcept {
    return val;
  };
 private:
  int val;
};

MyInt nonConstNum(23);

// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}

void printArgByValue(MyInt num) {
    printNum(num.get());
}

void printArgByConstReference(const MyInt& num) {
    printNum(num.get());
}

then we have to make the actual virtual call because somebody may have passed us an instance of a subclass of MyInt instead.

Inlining and LICM

Similar considerations hold for LICM. Here is our previous LICM example with MyInt:

class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept {
    return val;
  };
 private:
  int val;
};

void multiplyArrayByTwoConstants(std::span<int> arr, MyInt x, MyInt y) {
    for (int& element: arr) {
        element *= x.get() * y.get();
    }
}

and again we get the same assembly thanks to inlining.

Supposing MyInt::get() is in another .cpp file:²

#include <span>

class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept;
 private:
  int val;
};

void multiplyArrayByTwoConstants(std::span<int> arr, MyInt x, MyInt y) {
    for (int& element: arr) {
        element *= x.get() * y.get();
    }
}

Now the generated assembly has calls to MyInt::get() in the loop body. Note that it does not matter that MyInt::get() is const and thus “shouldn’t” change across loop iterations! For example, it is legal to use const_cast in the body of MyInt::get() to cast away const and mutate val so long as it never gets called on a truly-const MyInt instance (which would trigger undefined behavior), so the compiler cannot assume that that doesn’t happen.³

An classic LICM failure in practice: strlen

One classic example where this can come up is when using strlen() and array notation to iterate over C strings.⁴ If the compiler doesn’t apply LICM, functions that appear to take linear time can actually take quadratic time.

Let’s check if all characters in a non-null C-string are uppercase:

#include <cctype>
#include <cstring>

bool isUpperCase(const char *s) {
  for (int ii = 0; ii < strlen(s); ++ii) {
    if (!isupper(s[ii])) {
      return false;
    }
  }
  return true;
}

This code calls strlen on every iteration to determine the length of the string! However, we can see that the compiler knows to treat strlen specially and applies LICM: strlen is called only once, at the start of the loop. If we use a custom myStringLength function instead, we have no such luck: myStringLength is called on each iteration.

We can even run into this problem with strlen itself. Consider this code that converts a non-null C string to uppercase:

#include <cctype>
#include <cstring>

void toUpperCase(char *s) {
  for (int ii = 0; ii < std::strlen(s); ++ii) {
    s[ii] = std::toupper(s[ii]);
  }
}

We just learned about the special LICM case for strlen, but strlen is still being called on every loop iteration! The problem is that the compiler doesn’t figure out that toupper will never return the NUL character and thus change the string length, so it must repeatedly call strlen to preserve correctness in case the length has changed.⁵

We can fix the issue ourselves by performing LICM manually and calling strlen once outside the loop:

#include <cctype>
#include <cstring>

void toUpperCase(char *s) {
  const auto len = std::strlen(s);
  for (int ii = 0; ii < len; ++ii) {
    s[ii] = std::toupper(s[ii]);
  }
}

We can also just use pointer notation:

#include <cctype>
#include <cstring>

void toUpperCase(char *s) {
  while (*s) {
    *s = std::toupper(*s);
    s++;
  }
}

(If you’d like to see a demonstration, I put benchmarks up on Quick Bench.)

Conclusion: Should everything be in a header to enable inlining?

It is starting to seem like the moral of our story is “make everything inline in a header”.⁶ While this might help with performance on balance, it would likely also come at a heavy cost to build times, and might tickle pathological compiler behaviors because huge files aren’t their primary use case. For example, an engineer on Chromium found a super-linear relationship between token count and compile time. Instead, the fact that inlining can help enable other classical compiler optimizations is just one more thing to be aware of as a software developer.

I am ignoring the possibility of LTO or thinLTO inlining MyInt::get anyway. Perhaps LTO is off because it makes compile times too long, and/or maybe thinLTO happens not to inline this particular function. ↩︎
Again, ignoring the possibility of it being inlined by LTO or thinLTO. ↩︎
Herb Sutter’s GotW #81 has more on this. ↩︎
This is a reason that the pointer notation probably ought to be preferred if you have to work with C strings, but I’m ignoring that for the sake of the example. ↩︎
If you are thinking that the compiler should just know a lot more about toupper, imagine a custom character-modifying function that isn’t available for inlining. ↩︎
Note that this is a separate question from “should the compiler inline all functions whose bodies it can see?”, because inlining decisions are up to the compiler, even for functions declared with the inline keyword! See the explanation on the cppreference page about inline for more details. In short, inline means “don’t give me errors about violating ODR for defining this more than once”, not “you must inline this”.
(For what it’s worth, the compiler clearly should not inline all functions. This would cause massive code bloat, resulting in huge executables and much more pressure on the CPU’s limited instruction cache space.) ↩︎

Scott Wolchok