Why is inlining so important in C++? Clearly, it reduces function call overhead: if a function is inlined, there is no need to spend time setting up its arguments, jumping to it, creating a stack frame, and then undoing all that upon returning. More interestingly, though, inlining enables other compiler optimizations. In this article, I will show examples of constant propagation and loop-invariant code motion (LICM). Then, I will explain how inlining enables these optimizations to apply more widely and show an example of the consequences when that doesn’t happen.
Constant propagation turns references to variables whose value is known into uses of those constant values instead. For example, instead of a load from a memory location, the compiler may generate a use of an immediate value. Consider the following example:
#include <cstdio>
int nonConstNum = 23;
// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}
void printNonConst() {
    printNum(nonConstNum);
}
void printConst() {
    const int constNum = 42;
    printNum(constNum);
}
void printEffectivelyConst() {
    int effectivelyConstNum = 123;
    printNum(effectivelyConstNum);
}
void printInlineConstant() {
    printNum(123);
}
If we look at the generated assembly with -O3, we can see that printNonConst is loading from memory, because nonConstNum really isn’t constant: some other file in the program may declare extern int nonConstNum; and mutate it. Every other example is using a mov with an immediate to set the argument to printNum, because the compiler can see that their values are constants.
Loop-invariant code motion, as its name suggests, moves expressions that are constant across iterations of a loop (hence “loop-invariant”) outside the loop. This is sometimes referred to as “hoisting” code, because the code is moved “above” the loop body. For example:
#include <span>
void multiplyArrayByTwoConstants(std::span<int> arr, int x, int y) {
    for (int& element: arr) {
        element *= x * y;
    }
}
The expression x * y does not change throughout the loop, so the
compiler moves that multiplication to the start of the generated
assembly
(see line 4) to avoid calculating it repeatedly.
Inlining enables compiler optimizations like constant propagation and LICM to be performed even in the presence of function or method calls. Let’s add a toy abstraction to our constant propagation example:
#include <cstdio>
class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept {
    return val;
  };
 private:
  int val;
};
MyInt nonConstNum(23);
// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}
void printNonConst() {
    printNum(nonConstNum.get());
}
void printConst() {
    const MyInt constNum(42);
    printNum(constNum.get());
}
void printEffectivelyConst() {
    MyInt effectivelyConstNum(123);
    printNum(effectivelyConstNum.get());
}
void printInlineConstant() {
    printNum(MyInt(123).get());
}
If we look at it side-by-side with the original
code,
we can see that we get the same assembly output despite adding the
MyInt::get() function.
But what if we aren’t able to inline MyInt::get()? For example, it
might be implemented in a different .cpp file:1
#include <cstdio>
class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept;
 private:
  int val;
};
MyInt nonConstNum(23);
// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}
void printNonConst() {
    printNum(nonConstNum.get());
}
void printConst() {
    const MyInt constNum(42);
    printNum(constNum.get());
}
void printEffectivelyConst() {
    MyInt effectivelyConstNum(123);
    printNum(effectivelyConstNum.get());
}
void printInlineConstant() {
    printNum(MyInt(123).get());
}
Now, the generated assembly is much
worse:
in all cases, we have to construct a MyInt on the stack and make an
actual function call to MyInt::get().
What about if MyInt::get() is virtual but inline? With our example as written, that is fine and the compiler can still inline it because it knows that the actual type of our object is MyInt.
If, however, we happen to have a pointer or reference to our MyInt, like this:
#include <cstdio>
class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  virtual int get() const noexcept {
    return val;
  };
 private:
  int val;
};
MyInt nonConstNum(23);
// noinline to make the other print functions'
// assembly easy to read.
__attribute__((noinline))
void printNum(int num) {
    std::printf("%d\n", num);
}
void printArgByValue(MyInt num) {
    printNum(num.get());
}
void printArgByConstReference(const MyInt& num) {
    printNum(num.get());
}
then we have to make the actual virtual call because somebody may have passed us an instance of a subclass of MyInt instead.
Similar considerations hold for LICM. Here is our previous LICM
example with MyInt:
class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept {
    return val;
  };
 private:
  int val;
};
void multiplyArrayByTwoConstants(std::span<int> arr, MyInt x, MyInt y) {
    for (int& element: arr) {
        element *= x.get() * y.get();
    }
}
and again we get the same assembly thanks to inlining.
Supposing MyInt::get() is in another .cpp file:2
#include <span>
class MyInt {
 public:
  explicit MyInt(int x) : val(x) {}
  int get() const noexcept;
 private:
  int val;
};
void multiplyArrayByTwoConstants(std::span<int> arr, MyInt x, MyInt y) {
    for (int& element: arr) {
        element *= x.get() * y.get();
    }
}
Now the generated
assembly
has calls to MyInt::get() in the loop body. Note that it does not
matter that MyInt::get() is const and thus “shouldn’t” change
across loop iterations! For example, it is legal to use
const_cast
in the body of MyInt::get() to cast away const and mutate val so
long as it never gets called on a truly-const MyInt instance
(which would trigger undefined behavior), so the compiler cannot
assume that that doesn’t happen.3
One classic example where this can come up is when using strlen()
and array notation to iterate over C
strings.4 If the compiler doesn’t apply LICM,
functions that appear to take linear time can actually take quadratic
time.
Let’s check if all characters in a non-null C-string are uppercase:
#include <cctype>
#include <cstring>
bool isUpperCase(const char *s) {
  for (int ii = 0; ii < strlen(s); ++ii) {
    if (!isupper(s[ii])) {
      return false;
    }
  }
  return true;
}
This code calls strlen on every iteration to determine the length of
the string! However, we can
see
that the compiler knows to treat strlen specially and applies LICM:
strlen is called only once, at the start of the loop. If we use a
custom myStringLength function
instead,
we have no such luck: myStringLength is called on each iteration.
We can even run into this problem with strlen itself. Consider this
code that converts a non-null C string to uppercase:
#include <cctype>
#include <cstring>
void toUpperCase(char *s) {
  for (int ii = 0; ii < std::strlen(s); ++ii) {
    s[ii] = std::toupper(s[ii]);
  }
}
We just learned about the special LICM case for strlen, but
strlen is still being called on every loop
iteration!
The problem is that the compiler doesn’t figure out that toupper
will never return the NUL character and thus change the string length,
so it must repeatedly call strlen to preserve correctness in case
the length has changed.5
We can fix the issue ourselves by performing LICM manually and calling
strlen once outside the loop:
#include <cctype>
#include <cstring>
void toUpperCase(char *s) {
  const auto len = std::strlen(s);
  for (int ii = 0; ii < len; ++ii) {
    s[ii] = std::toupper(s[ii]);
  }
}
We can also just use pointer notation:
#include <cctype>
#include <cstring>
void toUpperCase(char *s) {
  while (*s) {
    *s = std::toupper(*s);
    s++;
  }
}
(If you’d like to see a demonstration, I put benchmarks up on Quick Bench.)
It is starting to seem like the moral of our story is “make everything inline in a header”.6 While this might help with performance on balance, it would likely also come at a heavy cost to build times, and might tickle pathological compiler behaviors because huge files aren’t their primary use case. For example, an engineer on Chromium found a super-linear relationship between token count and compile time. Instead, the fact that inlining can help enable other classical compiler optimizations is just one more thing to be aware of as a software developer.
I am ignoring the possibility of
LTO or
thinLTO inlining
MyInt::get anyway. Perhaps LTO is off because it makes compile
times too long, and/or maybe thinLTO happens not to inline this
particular function. ↩︎
Again, ignoring the possibility of it being inlined by LTO or thinLTO. ↩︎
This is a reason that the pointer notation probably ought to be preferred if you have to work with C strings, but I’m ignoring that for the sake of the example. ↩︎
If you are thinking that the compiler should
just know a lot more about toupper, imagine a custom
character-modifying function that isn’t available for inlining. ↩︎
Note that this is a separate question from
“should the compiler inline all functions whose bodies it can
see?”, because inlining decisions are up to the compiler, even
for functions declared with the inline keyword! See the
explanation on the cppreference page about
inline for
more details. In short, inline means “don’t give me errors about
violating ODR for defining this more than once”, not “you must
inline this”.
(For what it’s worth, the compiler clearly should not inline all
functions. This would cause massive code bloat, resulting in huge
executables and much more pressure on the CPU’s limited
instruction cache
space.) ↩︎