UPDATE: This article now has an ARM64 port.
Why, in 2021, does anyone need to learn about assembly language? First, reading assembly language is the way to know exactly what your program is doing. Why, exactly, is that C++ program 1 MiB (say) instead of 100 KiB? Is it possible to squeeze some more performance out of that function that gets called all the time?
For C++ in particular, it is easy to forget or just not notice some operation (e.g., an implicit conversion or a call to a copy constructor or destructor) that is implied by the source code and language semantics, but not spelled out explicitly. Looking at the assembly generated by the compiler puts everything in plain sight.
Second, the more practical reason: so far, posts on this blog haven’t required an understanding of assembly language, despite constant links to Compiler Explorer. By popular demand, however, our next topic will be parameter passing, and for that, we will need a basic understanding of assembly language. We will focus only on reading assembly language, not writing it.
The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. (The x86 architecture has lots of not-so-small instructions as well. Some of these are legacy cruft built up over the 40-odd years of the architecture’s existence, and others are newfangled additions. )
Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
};
int64_t normSquared(Vec2 v) {
return v.x * v.x + v.y * v.y;
}
and here is the resulting x86-64 assembly from clang 11, via Compiler Explorer:1
imulq %rdi, %rdi
imulq %rsi, %rsi
leaq (%rsi,%rdi), %rax
retq
Let’s talk about that first instruction: imulq %rdi, %rdi
. This
instruction performs signed integer
multiplication. The q
suffix tells us that it is operating on 64-bit quantities. (In
contrast, l
, w
, and b
would denote 32-bit, 16-bit, and 8-bit,
respectively.) It multiplies the value in the first given register
(rdi
; register names are prefixed with a %
sign) by the value in
the second register and stores the result in that second
register. This is squaring v.x
in our example C++ code.
The second instruction does the same with the value in %rsi
, which
squares v.y
.
Next, we have an odd instruction: leaq (%rsi,%rdi), %rax
. lea
stands for “load effective address”, and it stores the address of the
first operand into the second operand. (%rsi, %rdi)
means “the
memory location pointed to by %rsi + %rdi
”, so this is just adding
%rsi
and %rdi
and storing the result in %rax
. lea
is a quirky
x86-specific instruction; on a more
RISC-y
architecture like ARM64, we would expect to see a plain old add
instruction.2
Finally, retq
returns from the normSquared
function.
Let’s take a brief detour to explain what the registers we saw in our
example are. Registers are the “variables” of assembly
langauge. Unlike your favorite programming language (probably), there
are a finite number of them, they have standardized names, and the
ones we’ll be talking about are at most 64 bits in size. Some of them
have specific uses that we’ll see later. I wouldn’t be able to rattle
this off from memory, but per
Wikipedia,
the full list3 of 16 registers on x86_64 is rax
, rcx
, rdx
, rbx
,
rsp
, rbp
, rsi
, rdi
, r8
, r9
, r10
, r11
, r12
, r13
,
r14
, and r15
.
Now, let’s extend our example to debug print the Vec2
in normSquared
:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
void debugPrint() const;
};
int64_t normSquared(Vec2 v) {
v.debugPrint();
return v.x * v.x + v.y * v.y;
}
and, again, let’s see the generated assembly:
subq $24, %rsp
movq %rdi, 8(%rsp)
movq %rsi, 16(%rsp)
leaq 8(%rsp), %rdi
callq Vec2::debugPrint() const
movq 8(%rsp), %rcx
movq 16(%rsp), %rax
imulq %rcx, %rcx
imulq %rax, %rax
addq %rcx, %rax
addq $24, %rsp
retq
In addition to the obvious call to Vec2::debugPrint() const
, we have
some other new instructions and registers! %rsp
is special: it is
the “stack pointer”, used to maintain the function call
stack. It points to the
bottom of the stack, which grows “down” (toward lower addresses) on
x86. So, our subq $24, %rsp
instruction is making space for three
64-bit integers on the stack. (In general, setting up the stack and
registers at the start of your function is called the function
prologue.) Then, the
following two mov
instructions store the first and second arguments
to normSquared
, which are v.x
and v.y
(more about how parameter
passing words in the next blog post!) to the stack, effectively
creating a copy of v
in memory at the address %rsp + 8
. Next, we
load the address of our copy of v
into %rdi
with leaq 8(%rsp), %rdi
and then call Vec2::debugPrint() const
.
After debugPrint
has returned, we load v.x
and v.y
back into
%rcx
and %rax
. We have the same imulq
and addq
instructions as
before. Finally, we addq $24, %rsp
to clean up the 24
bytes4 of stack space we allocated at the start of
our function (called the function
epilogue),
and then return to our caller with retq
.
Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings.5 We might write something like the following:
#include <cstdio>
#include <cstring>
#include <memory>
void copyUppercase(char *dest, const char *src);
constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;
void printUpperCase(const char *s) {
auto sSize = strlen(s);
if (sSize <= MAX_STACK_ARRAY_SIZE) {
char temp[sSize + 1];
copyUppercase(temp, s);
puts(temp);
} else {
// std::make_unique_for_overwrite is missing on Godbolt.
std::unique_ptr<char[]> temp(new char[sSize + 1]);
copyUppercase(temp.get(), s);
puts(temp.get());
}
}
Here is the generated assembly:6
printUpperCase(char const*): # @printUpperCase(char const*)
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
movq %rdi, %r14
callq strlen
leaq 1(%rax), %rdi
cmpq $1024, %rax # imm = 0x400
ja .LBB0_2
movq %rsp, %r15
movq %rsp, %rbx
addq $15, %rdi
andq $-16, %rdi
subq %rdi, %rbx
movq %rbx, %rsp
movq %rbx, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
movq %r15, %rsp
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.LBB0_2:
callq operator new[](unsigned long)
movq %rax, %rbx
movq %rax, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
movq %rbx, %rdi
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
jmp operator delete[](void*) # TAILCALL
Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
movq %rdi, %r14
The pushq %rbp; movq %rsp, %rbp
sequence is very common: it pushes
the frame
pointer
stored in %rbp
to the stack and saves the old stack pointer
(which is the new frame pointer) in %rbp
. The following four
pushq
instructions store registers that we need to save before
using.7
On to the function body. We save our first argument (%rdi
) in
%r14
, because we are about to call strlen
and it, like any other
function, may overwrite %rdi
(we say %rdi
is “caller-saved”) but
not %r14
(we say %r14
is “callee-saved”). We call strlen(s)
with callq strlen
and store sSize + 1
in %rdi
with lea 1(%rax), %rdi
.
Next, we finally see our first if
statement! cmpq $1024, %rax
sets
the flags register
according to the result of %rax - $1024
, and then ja .LBB0_2
(“jump if above”) transfers control to the location labeled .LBB0_2
if the flags indicate that %rax > 1024
. In general, higher-level
control-flow primitives like if
/else
statements and loops are
implemented in assembly using conditional jump instructions.
Let’s first look at the path where %rax <= 1024
and thus the branch
to .LBB0_2
was not taken. We have a blob of instructions to create
char temp[sSize + 1]
on the stack:
movq %rsp, %r15
movq %rsp, %rbx
addq $15, %rdi
andq $-16, %rdi
subq %rdi, %rbx
movq %rbx, %rsp
We save %rsp
to %r15
and %rbx
for later
use.8 Then, we add 15 to %rdi
(which,
remember, contains the size of our array), mask off the lower 4 bits
with andq $-16, %rdi
, and subtract the result from %rbx
, which we
then put back into %rsp
. In short, this rounds the array size up to
the next multiple of 16 bytes and makes space for it on the stack.
The following block simply calls copyUppercase
and puts
as written in the code:
movq %rbx, %rdi
movq %r14, %rsi
callq copyUppercase(char*, char const*)
movq %rbx, %rdi
callq puts
Finally, we have our function epilogue:
movq %r15, %rsp
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
We restore the stack pointer to deallocate our variable-length array
using leaq
. Then, we popq
the registers we saved during the
function prologue and return control to our caller, and we are done.
Next, let’s look at the path when %rax > 1024
and we branch to
.LBB0_2
. This path is more straightforward. We call operator new[]
, save the result (returned in %rax
) to %rbx
, and call
copyUppercase
and puts
as before. We have a separate function
epilogue for this case, and it looks a bit different:
movq %rbx, %rdi
leaq -24(%rbp), %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
jmp operator delete[](void*) # TAILCALL
The first mov
sets up %rdi
with a pointer to our heap-allocated
array that we saved earlier. As with the other function epilogue, we
then restore the stack pointer and pop our saved registers. Finally,
we have a new instruction: jmp operator delete[](void *)
. jmp
is
just like goto
: it transfers control to the given label or
function. Unlike callq
, it does not push the return address onto the
stack. So, when operator delete[]
returns, it will instead transfer
control to printUpperCase
’s caller. In essence, we’ve combined a
callq
to operator delete[]
with our own retq
. This is called tail
call optimization, hence the
# TAILCALL
comment helpfully emitted by the compiler.
I said in the introduction that reading assembly makes implicit copy and destroy operations abundantly clear. We saw some of that in our previous example, but I want to close by looking at a common C++ move semantics debate. Is it OK to take parameters by value in order to avoid having one overload for lvalue references and another overload for rvalue references? There is a school of thought that says “yes, because in the lvalue case you will make a copy anyway, and in the rvalue case it’s fine as long as your type is cheap to move”. If we take a look at an example for the rvalue case, we will see that “cheap to move” does not mean “free to move”, as much as we might prefer otherwise. If we want maximum performance, we can demonstrate that the overload solution will get us there and the by-value solution will not. (Of course, if we aren’t willing to write extra code to improve performance, then “cheap to move” is probably cheap enough.)
#include <string>
class MyString {
std::string str;
public:
explicit MyString(const std::string& s);
explicit MyString(std::string&& s);
};
class MyOtherString {
std::string str;
public:
explicit MyOtherString(std::string s);
};
void createRvalue1(std::string&& s) {
MyString s2(std::move(s));
};
void createRvalue2(std::string&& s) {
MyOtherString s2(std::move(s));
};
If we look at the generated
assembly9
(which is too long to include even though I’ve intentionally
outlined10 the constructors in question), we can
see that createRvalue1
does 1 move operation (inside the body of
MyString::MyString(std::string&&)
) and 1 std::string::~string()
call (the operator delete
before returning). In contrast,
createRvalue2
is much longer: it does a total of 2 move operations
(1 inline, into the s
parameter for
MyOtherString::MyOtherString(std::string s)
, and 1 in the body of
that same constructor) and 2 std::string::~string
calls (1 for the
aforementioned s
parameter and 1 for the MyOtherString::str
member). To be fair, moving std::string
is cheap and so is
destroying a moved-from std::string
, but it is not free in terms of
either CPU time or code size.
Assembly language dates back to the late 1940s, so there are plenty of resources for learning about it. Personally, my first introduction to assembly language was in the EECS 370: Introduction to Computer Organization junior-level course at my alma mater, the University of Michigan. Unfortunately, most of the course materials linked on that website are not public. Here are what appear to be the corresponding “how computers really work” courses at Berkeley (CS 61C), Carnegie Mellon (15-213), Stanford (CS107), and MIT (6.004). (Please let me know if I’ve suggested the wrong course for any of thse schools!) Nand to Tetris also appears to cover similar material, and the projects and book chapters are freely available.
My first practical exposure to x86 assembly in particular was in the context of security exploits, or learning to become a “l33t h4x0r”, as the kids used to say. If this strikes you as a more entertaining reason to learn about assembly, great! The classic work in the space is Smashing the Stack for Fun and Profit. Unfortunately, modern security mitigations complicate running the examples in that article on your own, so I recommend finding a more modern practice environment. Microcorruption is an industry-created example, or you could try finding an application security project from a college security course to follow along with (e.g., Project 1 from Berkeley’s CS 161, which seems to be publicly available currently).
Finally, there is always Google and Hacker News. Pat Shaughnessy’s “Learning to Read x86 Assembly Language from 2016 covers the topic from the perspective of Ruby and Crystal, and there was also a recent (2020) discussion on how to learn x86_64 assembly.
Good luck, and happy hacking!
I use AT&T syntax because it’s the default syntax in
Linux tools. If you prefer Intel syntax, the toggle is on Compiler
Explorer under “Output”. Compiler Explorer links in this article
will show both, with AT&T on the left and Intel on the
right. Guides to the differences are short and readily
available;
briefly, Intel syntax is more explicit about memory references,
drops the b
/w
/l
/q
suffixes, and puts the destination
operand first instead of last. ↩︎
If you actually look at the ARM64
assembly
for this example, you’ll see an madd
instruction get used
instead: madd x0, x0, x0, x8
. This is a multiply+add in one
instruction: it’s doing x0 = x0 * x0 + x8
. ↩︎
These are just the 64-bit registers used by most integer instructions. There are actually a lot more registers that came with floating point and instruction set extensions. ↩︎
You may have noticed that we only used 16
bytes of stack space despite allocating 24. As far as I can tell,
the extra 8 bytes are left over from the code to set up and
restore the frame
pointer,
which was optimized out. Clang, gcc, and icc all seem to leave the
extra 8 bytes in, and msvc seems to waste 16 bytes instead
of 8. If we build with
-fno-omit-frame-pointer,
we can see that the other 8 bytes are used to
UPDATE: The extra 8 bytes of stack space are because Section 3.2.2
of the System V x86_64
ABI requires that the
stack frame must be aligned to a 16-byte boundary when calling a
function. In other words, every compiler made this “mistake”
because it’s required! ↩︎pushq %rbp
at the
start of the function and later popq %rbp
at the end. Compilers
aren’t perfect; you will see this sort of small missed
optimization from time to time if you read assembly a
lot. Sometimes things really are missed optimization
opportunities, but there are also lots of unfortunate ABI
constraints that force suboptimal code generation for reasons of
compatibility between pieces of code built with different
compilers (or even different versions of the same compiler).
Also suppose that we don’t have something like absl::FixedArray available. I didn’t want to complicate the example any further. ↩︎
I built with -fno-exceptions
to simplify
the example by removing the exception cleanup path. It appears
right after a tail call, which I think might be confusing. ↩︎
Another possible missed optimization: I
don’t see a need to UPDATE: this is likely because
pushing a register is smaller and/or faster than issuing a pushq %rax
here; it’s not callee-saved and
we don’t care about the value on entry to printUpperCase
. Get in
touch if you know whether this is a missed optimization or there’s
actually a reason to do it!sub 8, %rsp
instruction. ↩︎
Yet again, I think that this movq %rsp, %r15
is not needed. %r15
is not used again until we movq %r15, %rsp
, but that instruction is immediately followed by leaq -24(%rbp), %rsp
, which overwrites %rsp
immediately. I think
that we could improve the code by removing the two movq %rsp, %r15
and movq %r15, %rsp
instructions. On the other hand,
Intel’s icc compiler also does seemingly silly things to restore
%rsp
given this code, so either there is a good reason to do them, or
cleaning up stack pointer manipulations in the presence of
variable-length arrays is just a hard or neglected problem in
compilers. Again, feel free to reach out if you know which it is! ↩︎
Again, I built with -fno-exceptions
to avoid
complicating things with exception cleanup paths. ↩︎
If we inline the constructors for MyString
and MyOtherString
, we do get some savings in createRvalue2
: we call operator delete
at most once. However, we still do 2 move operations and we require 32 extra bytes of stack space. ↩︎