Scott Wolchok

The Sad Truth About C++ Copy Elision

Posted at — Apr 3, 2021

Copy elision is a C++ compiler optimization that, as its name suggests, eliminates extra copy and move operations. It is similar to the classical copy propagation optimization, but specifically performed on C++ objects that may have non-trivial copy and move constructors. In this post, I’ll walk through an example where an obvious optimization you might expect from your compiler doesn’t actually happen in practice.

Introducing an extra variable to break up a long line

Let’s say that you have a long function call that returns an object, and you want to immediately pass that object to another function, like this:

#include <string>
#include <string_view>

// Some type that is expensive to copy, non-trivial to destroy, and cheap but
// not free to move.
struct Widget {
  std::string s;
};

void consume(Widget w);

Widget doSomeVeryComplicatedThingWithSeveralArguments(
  int arg1, std::string_view arg2);

void someFunction() {
    consume(doSomeVeryComplicatedThingWithSeveralArguments(123, "hello"));
}

As we can see from the generated assembly, all is well:

someFunction():                      # @someFunction()
        pushq   %rbx
        subq    $32, %rsp
        movq    %rsp, %rbx
        movl    $5, %edx
        movl    $.L.str, %ecx
        movq    %rbx, %rdi
        movl    $123, %esi
        callq   doSomeVeryComplicatedThingWithSeveralArguments(int, std::basic_string_view<char, std::char_traits<char> >)
        movq    %rbx, %rdi
        callq   consume(Widget)
        movq    (%rsp), %rdi
        leaq    16(%rsp), %rax
        cmpq    %rax, %rdi
        je      .LBB0_2
        callq   operator delete(void*)
.LBB0_2:
        addq    $32, %rsp
        popq    %rbx
        retq
.L.str:
        .asciz  "hello"

Our temporary Widget returned from doSomeVeryComplicatedThingWithSeveralArguments is constructed in the stack space that someFunction allocated for it, and then a pointer to that stack space is passed straight to consume, as we should expect from learning about parameter passing previously.

Now, imagine that you decide that your single line in someFunction is too long, or that you want to give a meaningful name to the result of doSomeVeryComplicatedThingWithSeveralArguments, so you change the code:

void someFunctionV2() {
    auto complicatedThingResult =
        doSomeVeryComplicatedThingWithSeveralArguments(123, "hello");
    consume(complicatedThingResult);
}

Naturally, things go straight off the rails:

someFunctionV2():                    # @someFunctionV2()
        pushq   %r15
        pushq   %r14
        pushq   %r12
        pushq   %rbx
        subq    $72, %rsp
        leaq    40(%rsp), %rdi
        movl    $5, %edx
        movl    $.L.str, %ecx
        movl    $123, %esi
        callq   doSomeVeryComplicatedThingWithSeveralArguments(int, std::basic_string_view<char, std::char_traits<char> >)
        leaq    24(%rsp), %r12
        movq    %r12, 8(%rsp)
        movq    40(%rsp), %r14
        movq    48(%rsp), %rbx
        movq    %r12, %r15
        cmpq    $16, %rbx
        jb      .LBB1_4
        testq   %rbx, %rbx
        js      .LBB1_13
        movq    %rbx, %rdi
        incq    %rdi
        js      .LBB1_14
        callq   operator new(unsigned long)
        movq    %rax, %r15
        movq    %rax, 8(%rsp)
        movq    %rbx, 24(%rsp)
.LBB1_4:
        testq   %rbx, %rbx
        je      .LBB1_8
        cmpq    $1, %rbx
        jne     .LBB1_7
        movb    (%r14), %al
        movb    %al, (%r15)
        jmp     .LBB1_8
.LBB1_7:
        movq    %r15, %rdi
        movq    %r14, %rsi
        movq    %rbx, %rdx
        callq   memcpy
.LBB1_8:
        movq    %rbx, 16(%rsp)
        movb    $0, (%r15,%rbx)
        leaq    8(%rsp), %rdi
        callq   consume(Widget)
        movq    8(%rsp), %rdi
        cmpq    %r12, %rdi
        je      .LBB1_10
        callq   operator delete(void*)
.LBB1_10:
        movq    40(%rsp), %rdi
        leaq    56(%rsp), %rax
        cmpq    %rax, %rdi
        je      .LBB1_12
        callq   operator delete(void*)
.LBB1_12:
        addq    $72, %rsp
        popq    %rbx
        popq    %r12
        popq    %r14
        popq    %r15
        retq
.LBB1_13:
        movl    $.L.str.2, %edi
        callq   std::__throw_length_error(char const*)
.LBB1_14:
        callq   std::__throw_bad_alloc()
.L.str:
        .asciz  "hello"

.L.str.2:
        .asciz  "basic_string::_M_create"

Now we take our perfectly good Widget, complicatedThingResult, and copy it into a new temporary Widget to serve as the first argument to consume. When we’re done, we have to destroy two Widgets: both complicatedThingResult and the unnamed temporary Widget we passed to consume. You might expect that the compiler would optimize someFunctionV2() to be just like someFunction, but it won’t.

The problem, of course, is that we forgot to std::move complicatedThingResult:

void someFunctionV3() {
    auto complicatedThingResult =
        doSomeVeryComplicatedThingWithSeveralArguments(123, "hello");
    consume(std::move(complicatedThingResult));
}

and now, the generated assembly should looks just like our original example… wait, what?

someFunctionV3():                    # @someFunctionV3()
        pushq   %r14
        pushq   %rbx
        subq    $72, %rsp
        leaq    8(%rsp), %rdi
        movl    $5, %edx
        movl    $.L.str, %ecx
        movl    $123, %esi
        callq   doSomeVeryComplicatedThingWithSeveralArguments(int, std::basic_string_view<char, std::char_traits<char> >)
        leaq    56(%rsp), %r14
        movq    %r14, 40(%rsp)
        movq    8(%rsp), %rax
        leaq    24(%rsp), %rbx
        cmpq    %rbx, %rax
        je      .LBB1_1
        movq    %rax, 40(%rsp)
        movq    24(%rsp), %rax
        movq    %rax, 56(%rsp)
        jmp     .LBB1_3
.LBB1_1:
        movups  (%rax), %xmm0
        movups  %xmm0, (%r14)
.LBB1_3:
        movq    16(%rsp), %rax
        movq    %rax, 48(%rsp)
        movq    %rbx, 8(%rsp)
        movq    $0, 16(%rsp)
        movb    $0, 24(%rsp)
        leaq    40(%rsp), %rdi
        callq   consume(Widget)
        movq    40(%rsp), %rdi
        cmpq    %r14, %rdi
        je      .LBB1_5
        callq   operator delete(void*)
.LBB1_5:
        movq    8(%rsp), %rdi
        cmpq    %rbx, %rdi
        je      .LBB1_7
        callq   operator delete(void*)
.LBB1_7:
        addq    $72, %rsp
        popq    %rbx
        popq    %r14
        retq
.L.str:
        .asciz  "hello"

We still have two Widgets, it’s just that the temporary argument to consume is move constructed now. Our first version of someFunction is still smaller and faster!

So what’s going on here?

The fundamental problem with copy elision is that it is only allowed in a specific list of circumstances. (Briefly, RVO and initializing from a prvalue are required, NRVO is allowed, and some other cases with exceptions and coroutines are also allowed. Nothing else.) There is a philosophical reason for this: you wrote a copy constructor for your class that could do anything, and you expect it to run whenever objects of your class are copied according to the rules of C++. If compilers were to unpredictably remove copies, and thus remove pairs of copy/move constructor & destructor calls, they might break your code.

Specifically, there is simply nothing on the list of allowed circumstances for copy elision that applies to the examples we saw here. That list doesn’t include things like “the last time I use a variable before it goes out of scope” or “passing a variable to a function by value when I haven’t done anything else with it and it looks obviously safe”. Maybe it will in the future, but not in C++20 or before!