Now that we know how to read assembly language, we can talk about how parameter passing works at the machine level and its consequences for writing faster code. We will focus on x86_64, but ARM64 works in a roughly similar way.
Let’s say you’ve written a large program, and in it you have two files:
// Something.h
int doSomething(int x, int y);
// OtherStuff.cpp
void doOtherStuff() {
// ...
doSomething(123, 456);
// ...
}
The fundamental question of parameter passing is this: at the
assembly language level, what does otherStuff
need to do to pass
123
and 456
to doSomething
as x
and y
? There is no single
machine instruction for “call this function with these arguments”;
somebody had to decide how that would work and document the calling
convention for
x86_64!
On x86_64 Linux and macOS (among other OSes), these decisions follow the System V AMD64 ABI. If you really want to, you can stop reading now and go read that document instead. However, it is a long, technical specification and I’ve personally never read it straight through. In the rest of this article, I’ll walk through some example code, summarize how parameter passing works in each case, and talk through the implications for how you structure your code.
Consider this function that takes far too many arguments:
#include <cstdint>
int takePrimitives(
int8_t intArg1,
int16_t intArg2,
int32_t intArg3,
int64_t intArg4,
const char *intArg5,
int &intArg6,
float floatArg1,
double floatArg2,
int64_t intArg7,
int64_t intArg8,
int64_t intArg9,
int64_t intArg10);
static int x = 123456;
void callWithPrimitives() {
takePrimitives(1, 2, 3, 4, "hello", x, 0.5, 0.25, 6, 7, 8, 9);
}
and the corresponding generated assembly:
.LCPI0_0:
.long 0x3f000000 # float 0.5
.LCPI0_1:
.quad 0x3fd0000000000000 # double 0.25
callWithPrimitives(): # @callWithPrimitives()
pushq %rax
movss .LCPI0_0(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero
movsd .LCPI0_1(%rip), %xmm1 # xmm1 = mem[0],zero
movl $4, %ecx
movl $.L.str, %r8d
movl $x, %r9d
movl $1, %edi
movl $2, %esi
movl $3, %edx
pushq $9
pushq $8
pushq $7
pushq $6
callq takePrimitives(signed char, short, int, long, char const*, int&, float, double, long, long, long, long)
addq $32, %rsp
popq %rax
retq
.L.str:
.asciz "hello"
x:
.long 123456 # 0x1e240
This lays out parameter passing for primitives nicely. The first 6
integer or pointer arguments (intArg1
through intArg6
in the
example) go into %rdi
, %rsi
, %rdx
, %rcx
, %r8
, and %r9
, in
that order, regardless of whether they are 1, 2, 4, or 8 bytes in
size. C++ reference parameters are represented as
pointers. The
first 8 floating point arguments go in registers %xmm0
through
%xmm7
(here, we use only %xmm0
and %xmm1
). Remaining arguments
are pushed onto the stack from right to left.
Suppose that you have a common series of arguments that you need to pass to several functions in a row. It would be ideal if those arguments stayed in the same order and positions:
#include <cstdint>
struct Color {
int r, g, b, a;
};
Color makeColor(int red, int green, int blue, int alpha);
Color makeColor(int red, int green, int blue) {
return makeColor(red, green, blue, 255);
}
Color makeColorBad(int alpha, int red, int green, int blue);
Color makeColorBad(int red, int green, int blue) {
return makeColorBad(255, red, green, blue);
}
and the generated assembly:
makeColor(int, int, int): # @makeColor(int, int, int)
movl $255, %ecx
jmp makeColor(int, int, int, int) # TAILCALL
makeColorBad(int, int, int): # @makeColorBad(int, int, int)
movl %edx, %ecx
movl %esi, %edx
movl %edi, %esi
movl $255, %edi
jmp makeColorBad(int, int, int, int) # TAILCALL
makeColorBad
has to shift red
, green
, and blue
to make
everything end up in the correct registers, whereas makeColor
just
puts 255
into the appropriate register and continues on.1
Returning primitives is even simpler than passing them. Here’s a quick example:
#include <cstdint>
int64_t getInt();
double getDouble();
void sinkInt(int64_t);
// Primitives should normally be passed by value;
// using reference here because the return register
// for double is %xmm0 and so is the first argument
// register, so the assembly would not show %xmm0 at
// all if passed by value.
void sinkDouble(const double&);
void demonstrateResultReturn() {
sinkInt(getInt());
sinkDouble(getDouble());
}
and the generated assembly:
demonstrateResultReturn(): # @demonstrateResultReturn()
pushq %rax
callq getInt()
movq %rax, %rdi
callq sinkInt(long)
callq getDouble()
movsd %xmm0, (%rsp)
movq %rsp, %rdi
callq sinkDouble(double const&)
popq %rax
retq
We can see that integer/pointer return values go in %rax
and
floating-point return values go in %xmm0
.
Now let’s see how structs work. Here are the different ways we could pass a 2-dimensional and 3-dimensional integer vector to a function:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
};
struct Vec3 {
int64_t x;
int64_t y;
int64_t z;
};
void takeVec2ByValue(Vec2);
void takeVec2ByPointer(Vec2*);
void takeVec2ByConstRef(const Vec2&);
void takeVec3ByValue(Vec3);
void takeVec3ByPointer(Vec3*);
void takeVec3ByConstRef(const Vec3&);
void callVec2ByValue() {
Vec2 v{1, 2};
takeVec2ByValue(v);
}
void callVec2ByPointer() {
Vec2 v{1, 2};
takeVec2ByPointer(&v);
}
void callVec2ByConstRef() {
Vec2 v{1, 2};
takeVec2ByConstRef(v);
}
void callVec3ByValue() {
Vec3 v{1, 2, 3};
takeVec3ByValue(v);
}
void callVec3ByPointer() {
Vec3 v{1, 2, 3};
takeVec3ByPointer(&v);
}
void callVec3ByConstRef() {
Vec3 v{1, 2, 3};
takeVec3ByConstRef(v);
}
and here is the generated assembly:
callVec2ByValue(): # @callVec2ByValue()
movl $1, %edi
movl $2, %esi
jmp takeVec2ByValue(Vec2) # TAILCALL
callVec2ByConstRef(): # @callVec2ByConstRef()
subq $24, %rsp
movups .L__const.callVec2ByPointer().v(%rip), %xmm0
movaps %xmm0, (%rsp)
movq %rsp, %rdi
callq takeVec2ByConstRef(Vec2 const&)
addq $24, %rsp
retq
callVec2ByPointer(): # @callVec2ByPointer()
subq $24, %rsp
movups .L__const.callVec2ByPointer().v(%rip), %xmm0
movaps %xmm0, (%rsp)
movq %rsp, %rdi
callq takeVec2ByPointer(Vec2*)
addq $24, %rsp
retq
callVec3ByValue(): # @callVec3ByValue()
subq $24, %rsp
movq .L__const.callVec3ByPointer().v+16(%rip), %rax
movq %rax, 16(%rsp)
movups .L__const.callVec3ByPointer().v(%rip), %xmm0
movups %xmm0, (%rsp)
callq takeVec3ByValue(Vec3)
addq $24, %rsp
retq
callVec3ByConstRef(): # @callVec3ByConstRef()
subq $24, %rsp
movq .L__const.callVec3ByPointer().v+16(%rip), %rax
movq %rax, 16(%rsp)
movups .L__const.callVec3ByPointer().v(%rip), %xmm0
movaps %xmm0, (%rsp)
movq %rsp, %rdi
callq takeVec3ByConstRef(Vec3 const&)
addq $24, %rsp
retq
callVec3ByPointer(): # @callVec3ByPointer()
subq $24, %rsp
movq .L__const.callVec3ByPointer().v+16(%rip), %rax
movq %rax, 16(%rsp)
movups .L__const.callVec3ByPointer().v(%rip), %xmm0
movaps %xmm0, (%rsp)
movq %rsp, %rdi
callq takeVec3ByPointer(Vec3*)
addq $24, %rsp
retq
.L__const.callVec2ByPointer().v:
.quad 1 # 0x1
.quad 2 # 0x2
.L__const.callVec3ByPointer().v:
.quad 1 # 0x1
.quad 2 # 0x2
.quad 3 # 0x3
For our Vec2
, passing by value is just like passing the two elements
separately: they go in registers, assuming there aren’t too many other
arguments. Passing by const reference is different: we create the
Vec2
on the stack (in this case, by copying its data from a
constant) and then pass a pointer to it in the first integer argument
register, %rdi
. This is identical to explicitly passing a pointer to
a Vec2
on the stack.
Vec3
is different, because structs larger than two 8-byte words
cannot go in registers. To pass a Vec3
by value, we push it onto the
stack, and the called function knows that it will find its argument
there. In contrast, when passing by const reference or pointer, we
still create our Vec3
on the stack, but we must also explicitly pass
a pointer to it in %rdi
. This is because the referred-to Vec3
could, of course, be anywhere in memory; it doesn’t have to be on the
caller’s stack.
Note that structs are laid out by
packing fields together
as close as possible, subject to
alignment
requirements (see “Aggregates and Unions” in Section 3.1.2 of the
System V AMD64 ABI
document). Let’s
take a quick look at what would happen if we used int32_t
instead of
int64_t
in our previous example:
#include <cstdint>
struct Vec2 {
int32_t x;
int32_t y;
};
struct Vec3 {
int32_t x;
int32_t y;
int32_t z;
};
void takeVec2ByValue(Vec2);
void takeVec2ByConstRef(const Vec2&);
void takeVec3ByValue(Vec3);
void takeVec3ByConstRef(const Vec3&);
void callVec2ByValue() {
Vec2 v{1, 2};
takeVec2ByValue(v);
}
void callVec2ByConstRef() {
Vec2 v{1, 2};
takeVec2ByConstRef(v);
}
void callVec3ByValue() {
Vec3 v{1, 2, 3};
takeVec3ByValue(v);
}
void callVec3ByConstRef() {
Vec3 v{1, 2, 3};
takeVec3ByConstRef(v);
}
We can see from the generated
assembly
that Vec2
now fits in one register and Vec3
fits in two registers;
we don’t waste space by using separate registers for each field.
Returning structs has a similar discontinuity going from Vec2
to Vec3
:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
};
struct Vec3 {
int64_t x;
int64_t y;
int64_t z;
};
Vec2 globalVec2;
Vec3 globalVec3;
Vec2 getVec2();
Vec3 getVec3();
void getVec3ByOutParameter(Vec3 *out);
void testReturningVec2() {
globalVec2 = getVec2();
}
void testReturningVec3() {
globalVec3 = getVec3();
}
void testVec3OutParameter() {
Vec3 out;
getVec3ByOutParameter(&out);
globalVec3 = out;
}
and the generated assembly:
testReturningVec2(): # @testReturningVec2()
pushq %rax
callq getVec2()
movq %rax, globalVec2(%rip)
movq %rdx, globalVec2+8(%rip)
popq %rax
retq
testReturningVec3(): # @testReturningVec3()
subq $24, %rsp
movq %rsp, %rdi
callq getVec3()
movq 16(%rsp), %rax
movq %rax, globalVec3+16(%rip)
movups (%rsp), %xmm0
movups %xmm0, globalVec3(%rip)
addq $24, %rsp
retq
testVec3OutParameter(): # @testVec3OutParameter()
subq $24, %rsp
movq %rsp, %rdi
callq getVec3ByOutParameter(Vec3*)
movq 16(%rsp), %rax
movq %rax, globalVec3+16(%rip)
movups (%rsp), %xmm0
movups %xmm0, globalVec3(%rip)
addq $24, %rsp
retq
globalVec2:
.zero 16
globalVec3:
.zero 24
Just like for parameter passing, small structs like Vec2
get
returned in registers %rax
and %rdx
, in that order.2 Larger structs
like Vec3
get returned on the stack: the calling function makes a
temporary Vec3
and passes a pointer to it in %rdi
, which the
called function fills in. Notice that the assembly for
testReturningVec3
is identical to the assembly for
testVec3OutParameter
!
If our Vec
structs contained double
instead of int64_t
, things
would work out similarly:
#include <cstdint>
struct Vec2 {
double x;
double y;
};
struct Vec3 {
double x;
double y;
double z;
};
void takeVec2ByValue(Vec2);
void takeVec2ByConstRef(const Vec2&);
void takeVec3ByValue(Vec3);
void takeVec3ByConstRef(const Vec3&);
void callVec2ByValue() {
Vec2 v{1, 2};
takeVec2ByValue(v);
}
void callVec2ByConstRef() {
Vec2 v{1, 2};
takeVec2ByConstRef(v);
}
void callVec3ByValue() {
Vec3 v{1, 2, 3};
takeVec3ByValue(v);
}
void callVec3ByConstRef() {
Vec3 v{1, 2, 3};
takeVec3ByConstRef(v);
}
The generated
assembly
(not included below for brevity) is nearly identical to the assembly
for the int64_t
case, except that callVec2ByValue
puts x
and y
in %xmm0
and %xmm1
instead of %rdi
and %rsi
, just as if they
were floating-point primitives not contained in a struct.
As you might expect, returning our new Vec2
is very similar to the
previous case as
well,
using %xmm0
and %xmm1
again instead of %rdi
and
%rsi
.
Furthermore, if we have one double
and one int64_t
, they
get returned in %xmm0
and
%rdi
,
but it is still the case that a 24-byte struct gets returned in
memory, not registers, even if it is a mix of double
and
integers.
One way that structure return can make a difference is when considering how to return multiple values from a function. You might do it by returning a struct, or you might return one value normally and one via an out parameter:
#include <utility>
// Let's not worry about the best way to read things from files.
struct FileHandle;
// Returns whether read succeeded without error; read integer goes
// into *out.
bool readIntFromFile(FileHandle& f, int* out);
std::pair<int, bool> readIntFromFile(FileHandle& f);
int lastIntRead;
void readIntOutParameter(FileHandle& f) {
int myInt;
if (readIntFromFile(f, &myInt)) {
lastIntRead = myInt;
}
}
void readIntStruct(FileHandle& f) {
auto p = readIntFromFile(f);
if (p.second) {
lastIntRead = p.first;
}
}
Here’s the generated assembly:
readIntOutParameter(FileHandle&): # @readIntOutParameter(FileHandle&)
pushq %rax
leaq 4(%rsp), %rsi
callq readIntFromFile(FileHandle&, int*)
testb %al, %al
je .LBB0_2
movl 4(%rsp), %eax
movl %eax, lastIntRead(%rip)
.LBB0_2:
popq %rax
retq
readIntStruct(FileHandle&): # @readIntStruct(FileHandle&)
pushq %rax
callq readIntFromFile(FileHandle&)
btq $32, %rax
jae .LBB1_2
movl %eax, lastIntRead(%rip)
.LBB1_2:
popq %rax
retq
lastIntRead:
.long 0 # 0x0
If we use an out parameter, we have to reserve stack space for it, pass a pointer to that stack space as the out parameter, and then load the value from memory in order to use it. If we return multiple values in a struct that fits in registers, we can access them without touching memory again. On modern processors, arithmetic operations on the CPU tend to be significantly faster than memory access, so I would guess that the struct return option would be slightly better. The important point to take away, though, is that it’s not worse, so if you think it’s more readable, go for it!
The System V AMD64 ABI document focuses on the C programming language. How to represent C++ types on x86_64 Linux and macOS is governed instead by the Itanium C++ ABI. In particular, Chapter 3 of that document “describes how to define and call functions.”
this
pointerFirst, let’s talk about C++ member functions. Where does the this
pointer come from? It turns out that it is a secret extra first
argument to all C++ member functions:
struct GetThis {
const void *getThis() const;
};
const void *GetThis::getThis() const {
return this;
}
const void *equivalentFunction(const void *p) {
return p;
}
We can see from the generated assembly that GetThis::getThis() const
and equivalentFunction
work exactly the same way.
*this
has to go in memoryThere is an immediate, arguably-disappointing consequence to this ABI
for member functions. We saw previously that small structs can (copy
constructors permitting, as we’ll see later) be passed in
registers. However, if you call an non-inline member function on a
small C++ object, this
must be a pointer, so the object must go in
memory, even if it started life in registers!
Let’s revisit the debugPrint
example from How to Read Assembly Language:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
void debugPrint() const;
};
int64_t normSquared(Vec2 v) {
v.debugPrint();
return v.x * v.x + v.y * v.y;
}
and its generated assembly:
subq $24, %rsp
movq %rdi, 8(%rsp)
movq %rsi, 16(%rsp)
leaq 8(%rsp), %rdi
callq Vec2::debugPrint() const
movq 8(%rsp), %rcx
movq 16(%rsp), %rax
imulq %rcx, %rcx
imulq %rax, %rax
addq %rcx, %rax
addq $24, %rsp
retq
We are forced to copy %rdi
and %rsi
(which hold our Vec2
) to the
stack so that we can supply the this
pointer to Vec2::debugPrint() const
! If we had instead written void Vec2DebugPrint(Vec2 v)
, we
could just call it directly:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
};
void Vec2DebugPrint(Vec2 v);
int64_t normSquared(Vec2 v) {
Vec2DebugPrint(v);
return v.x * v.x + v.y * v.y;
}
Looking at the new generated assembly:
pushq %r14
pushq %rbx
pushq %rax
movq %rsi, %r14
movq %rdi, %rbx
callq Vec2DebugPrint(Vec2)
imulq %rbx, %rbx
imulq %r14, %r14
leaq (%r14,%rbx), %rax
addq $8, %rsp
popq %rbx
popq %r14
retq
Now, we keep %rsi
and %rdi
in registers and don’t have to spill
them to the stack. However, this doesn’t come for free: we still have
to push and pop %r14
and %rbx
, which we use to stash our Vec2
so
we can continue using it after Vec2DebugPrint
returns. We could come
out ahead if our function was longer and had to push and pop those
registers anyway.
Would I recommend avoiding member functions just to pass objects in registers? Definitely not, but it is something to keep in the back of your mind in case you ever need to squeeze every last bit of performance out of some code. It would be even better to just ensure that simple member functions can be inlined and avoid parameter passing concerns altogether.
The Itanium ABI defines a type to be “non-trivial for the purposes of calls” if (quoting directly from the ABI):
A copy constructor, move constructor, or destructor is “trivial” if,
in short, it is not user-provided or defaulted and the class has no
virtual member functions.3 In particular,
~MyClass() = default
is trivial, but ~MyClass() {}
is not.
The consequences of a type being non-trivial for the purposes of calls are not good: in short, the object must be copied or moved to a temporary on the stack as appropriate, passed by reference, and then have the temporary’s destructor called.
Let’s see an example:
struct TrivialForPurposesOfCalls {
int x;
// Doesn't matter if we uncomment this or not.
// ~TrivialForPurposesOfCalls() = default;
};
struct NontrivialForPurposesOfCalls {
int x;
~NontrivialForPurposesOfCalls() {}
};
void sink(TrivialForPurposesOfCalls);
void sink(NontrivialForPurposesOfCalls);
void sinkConstRef(const NontrivialForPurposesOfCalls&);
void passTrivial() {
sink(TrivialForPurposesOfCalls{1});
}
void passNontrivial() {
sink(NontrivialForPurposesOfCalls{1});
}
void passNontrivialConstRef() {
sinkConstRef(NontrivialForPurposesOfCalls{1});
}
and the generated assembly:
passTrivial(): # @passTrivial()
movl $1, %edi
jmp sink(TrivialForPurposesOfCalls) # TAILCALL
passNontrivial(): # @passNontrivial()
pushq %rax
movl $1, (%rsp)
movq %rsp, %rdi
callq sink(NontrivialForPurposesOfCalls)
popq %rax
retq
passNontrivialConstRef(): # @passNontrivialConstRef()
pushq %rax
movl $1, (%rsp)
movq %rsp, %rdi
callq sinkConstRef(NontrivialForPurposesOfCalls const&)
popq %rax
retq
Our “default” destructor for NontrivialForPurposesOfCalls
was a
mistake! We cannot really pass it by value, in a register; we must
pass it as though as we were passing by const reference.
For another example of worse generated code for a non-trivial type,
check out Arthur O’Dwyer’s [[trivial_abi]]
101
article.
std::unique_ptr
One standard library type that gets affected by this is
std::unique_ptr
.4 std::unique_ptr
clearly has a non-trivial destructor, so it is non-trivial for
purposes of calls. Let’s compare some simple unique_ptr
code to a
version using int *
:
#include <memory>
// Suppose these functions are not inlined because
// they're defined in another file.
__attribute__((noinline))
void printAndFreeHeapAllocatedInt(std::unique_ptr<int> x) {
printf("%d\n", *x);
}
__attribute__((noinline))
void printAndFreeHeapAllocatedInt(int *x) {
printf("%d\n", *x);
delete x;
}
void consumeHeapAllocatedInt(std::unique_ptr<int> x) {
printAndFreeHeapAllocatedInt(std::move(x));
}
void consumeHeapAllocatedInt(int *x) {
printAndFreeHeapAllocatedInt(x);
}
From reading the generated
assembly,
we can see that the std::unique_ptr
version has the following extra
costs compared to the int *
version:
nullptr
) to the moved-from std::unique_ptr<int> x
x
to a temporary std::unique_ptr<int>
on the stackprintAndFreeHeapAllocatedInt
returns, it checks to see if
the temporary is nullptr
(it doesn’t know if
printAndFreeHeapAllocatedInt
wrote a nullptr
to it or not!)printAndFreeHeapAllocatedInt
to
dereference its implicit std::unique_ptr<int>*
Given the current Itanium C++ ABI, these costs are not avoidable when std::unique_ptr<int>
is passed by value.5
const std::shared_ptr<int>&
isn’t an extra indirection compared to std::shared_ptr<int>
If you were familiar with the way parameter passing works for C types
but didn’t know about the “non-trivial for the purposes of calls”
rules, you might have thought that passing const std::shared_ptr<T>&
implied a double indirection that shared_ptr<T>
does not. (In other
words, the former would pass a pointer to pointer to T, wheras the
latter would pass a pointer to T.) This is not the case:
#include <memory>
void print(int);
void byConstRef(const std::shared_ptr<int>& x) {
print(*x);
}
void byValue(std::shared_ptr<int> x) {
print(*x);
}
As we can
see,
both of these functions actually have the same
assembly.6 There is no need to worry about an
“extra” layer of indirection – it is already there in the by-value
case! (Of course, it would be better to just pass an int
directly
and have callers dereference any std::shared_ptr
they may or may not
have.)
To be fair, register-to-register moves are often unusually cheap instructions due to move elimination by the CPU, but they’re still not free and they certainly increase code size. ↩︎
Not all architectures’ ABIs have this nice symmetry
between result return and parameter passing. On 32-bit ARM,
parameters go in r0 through r3, similarly to x86_64, and results
are returned in r0. 64-bit primitives like int64_t
and double
are returned in r0
and r1
, again similarly to x86_64. However,
a struct
containing 2 ints has to be returned in memory, not in
r0
and r1
! (See Procedure Call Standard for the ARM
Architecture,
Section 5.4. ARM64 fixed this; the corresponding section of
Procedure Call Standard for the ARM 64-Bit
Architecture
defines result return simply by saying that result return for a
type goes in the same registers that would be used if the type was
passed as an argument.) ↩︎
The full definitions of trivial for copy constructors, move consructors, and destructors are slightly more complicated. ↩︎
Chandler Carruth presented these concerns as part of his excellent CppCon 2019 talk, “There Are No Zero-Cost Abstractions”. ↩︎
For more discussion on how these costs could be avoided in both the short term and the long term, see Chandler Carruth’s talk mentioned in the previous footnote. ↩︎
For byValue
, the caller is responsible for destroying x
. ↩︎