How to Read Assembly Language

Posted at — Feb 26, 2021

UPDATE: This article now has an ARM64 port.

Why, in 2021, does anyone need to learn about assembly language? First, reading assembly language is the way to know exactly what your program is doing. Why, exactly, is that C++ program 1 MiB (say) instead of 100 KiB? Is it possible to squeeze some more performance out of that function that gets called all the time?

For C++ in particular, it is easy to forget or just not notice some operation (e.g., an implicit conversion or a call to a copy constructor or destructor) that is implied by the source code and language semantics, but not spelled out explicitly. Looking at the assembly generated by the compiler puts everything in plain sight.

Second, the more practical reason: so far, posts on this blog haven’t required an understanding of assembly language, despite constant links to Compiler Explorer. By popular demand, however, our next topic will be parameter passing, and for that, we will need a basic understanding of assembly language. We will focus only on reading assembly language, not writing it.

Instructions

The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. (The x86 architecture has lots of not-so-small instructions as well. Some of these are legacy cruft built up over the 40-odd years of the architecture’s existence, and others are newfangled additions. )

Example 1: Vector norm

Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
};

int64_t normSquared(Vec2 v) {
    return v.x * v.x + v.y * v.y;
}

and here is the resulting x86-64 assembly from clang 11, via Compiler Explorer:¹

        imulq   %rdi, %rdi
        imulq   %rsi, %rsi
        leaq    (%rsi,%rdi), %rax
        retq

Let’s talk about that first instruction: imulq %rdi, %rdi. This instruction performs signed integer multiplication. The q suffix tells us that it is operating on 64-bit quantities. (In contrast, l, w, and b would denote 32-bit, 16-bit, and 8-bit, respectively.) It multiplies the value in the first given register (rdi; register names are prefixed with a % sign) by the value in the second register and stores the result in that second register. This is squaring v.x in our example C++ code.

The second instruction does the same with the value in %rsi, which squares v.y.

Next, we have an odd instruction: leaq (%rsi,%rdi), %rax. lea stands for “load effective address”, and it stores the address of the first operand into the second operand. (%rsi, %rdi) means “the memory location pointed to by %rsi + %rdi”, so this is just adding %rsi and %rdi and storing the result in %rax. lea is a quirky x86-specific instruction; on a more RISC-y architecture like ARM64, we would expect to see a plain old add instruction.²

Finally, retq returns from the normSquared function.

Registers

Let’s take a brief detour to explain what the registers we saw in our example are. Registers are the “variables” of assembly langauge. Unlike your favorite programming language (probably), there are a finite number of them, they have standardized names, and the ones we’ll be talking about are at most 64 bits in size. Some of them have specific uses that we’ll see later. I wouldn’t be able to rattle this off from memory, but per Wikipedia, the full list³ of 16 registers on x86_64 is rax, rcx, rdx, rbx, rsp, rbp, rsi, rdi, r8, r9, r10, r11, r12, r13, r14, and r15.

Example 2: The stack

Now, let’s extend our example to debug print the Vec2 in normSquared:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
    void debugPrint() const;
};

int64_t normSquared(Vec2 v) {
    v.debugPrint();
    return v.x * v.x + v.y * v.y;
}

and, again, let’s see the generated assembly:

        subq    $24, %rsp
        movq    %rdi, 8(%rsp)
        movq    %rsi, 16(%rsp)
        leaq    8(%rsp), %rdi
        callq   Vec2::debugPrint() const
        movq    8(%rsp), %rcx
        movq    16(%rsp), %rax
        imulq   %rcx, %rcx
        imulq   %rax, %rax
        addq    %rcx, %rax
        addq    $24, %rsp
        retq

In addition to the obvious call to Vec2::debugPrint() const, we have some other new instructions and registers! %rsp is special: it is the “stack pointer”, used to maintain the function call stack. It points to the bottom of the stack, which grows “down” (toward lower addresses) on x86. So, our subq $24, %rsp instruction is making space for three 64-bit integers on the stack. (In general, setting up the stack and registers at the start of your function is called the function prologue.) Then, the following two mov instructions store the first and second arguments to normSquared, which are v.x and v.y (more about how parameter passing words in the next blog post!) to the stack, effectively creating a copy of v in memory at the address %rsp + 8. Next, we load the address of our copy of v into %rdi with leaq 8(%rsp), %rdi and then call Vec2::debugPrint() const.

After debugPrint has returned, we load v.x and v.y back into %rcx and %rax. We have the same imulq and addq instructions as before. Finally, we addq $24, %rsp to clean up the 24 bytes⁴ of stack space we allocated at the start of our function (called the function epilogue), and then return to our caller with retq.

Example 3: Frame pointer and control flow

Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings.⁵ We might write something like the following:

#include <cstdio>
#include <cstring>
#include <memory>

void copyUppercase(char *dest, const char *src);

constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;

void printUpperCase(const char *s) {
    auto sSize = strlen(s);
    if (sSize <= MAX_STACK_ARRAY_SIZE) {
        char temp[sSize + 1];
        copyUppercase(temp, s);
        puts(temp);
    } else {
        // std::make_unique_for_overwrite is missing on Godbolt.
        std::unique_ptr<char[]> temp(new char[sSize + 1]);
        copyUppercase(temp.get(), s);
        puts(temp.get());
    }
}

Here is the generated assembly:⁶

printUpperCase(char const*):                  # @printUpperCase(char const*)
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        movq    %rdi, %r14
        callq   strlen
        leaq    1(%rax), %rdi
        cmpq    $1024, %rax                     # imm = 0x400
        ja      .LBB0_2
        movq    %rsp, %r15
        movq    %rsp, %rbx
        addq    $15, %rdi
        andq    $-16, %rdi
        subq    %rdi, %rbx
        movq    %rbx, %rsp
        movq    %rbx, %rdi
        movq    %r14, %rsi
        callq   copyUppercase(char*, char const*)
        movq    %rbx, %rdi
        callq   puts
        movq    %r15, %rsp
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
.LBB0_2:
        callq   operator new[](unsigned long)
        movq    %rax, %rbx
        movq    %rax, %rdi
        movq    %r14, %rsi
        callq   copyUppercase(char*, char const*)
        movq    %rbx, %rdi
        callq   puts
        movq    %rbx, %rdi
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        jmp     operator delete[](void*)                          # TAILCALL

Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:

        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        movq    %rdi, %r14

The pushq %rbp; movq %rsp, %rbp sequence is very common: it pushes the frame pointer stored in %rbp to the stack and saves the old stack pointer (which is the new frame pointer) in %rbp. The following four pushq instructions store registers that we need to save before using.⁷

On to the function body. We save our first argument (%rdi) in %r14, because we are about to call strlen and it, like any other function, may overwrite %rdi (we say %rdi is “caller-saved”) but not %r14 (we say %r14 is “callee-saved”). We call strlen(s) with callq strlen and store sSize + 1 in %rdi with lea 1(%rax), %rdi.

Next, we finally see our first if statement! cmpq $1024, %rax sets the flags register according to the result of %rax - $1024, and then ja .LBB0_2 (“jump if above”) transfers control to the location labeled .LBB0_2 if the flags indicate that %rax > 1024. In general, higher-level control-flow primitives like if/else statements and loops are implemented in assembly using conditional jump instructions.

Let’s first look at the path where %rax <= 1024 and thus the branch to .LBB0_2 was not taken. We have a blob of instructions to create char temp[sSize + 1] on the stack:

        movq    %rsp, %r15
        movq    %rsp, %rbx
        addq    $15, %rdi
        andq    $-16, %rdi
        subq    %rdi, %rbx
        movq    %rbx, %rsp

We save %rsp to %r15 and %rbx for later use.⁸ Then, we add 15 to %rdi (which, remember, contains the size of our array), mask off the lower 4 bits with andq $-16, %rdi, and subtract the result from %rbx, which we then put back into %rsp. In short, this rounds the array size up to the next multiple of 16 bytes and makes space for it on the stack.

The following block simply calls copyUppercase and puts as written in the code:

        movq    %rbx, %rdi
        movq    %r14, %rsi
        callq   copyUppercase(char*, char const*)
        movq    %rbx, %rdi
        callq   puts

Finally, we have our function epilogue:

        movq    %r15, %rsp
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        retq

We restore the stack pointer to deallocate our variable-length array using leaq. Then, we popq the registers we saved during the function prologue and return control to our caller, and we are done.

Next, let’s look at the path when %rax > 1024 and we branch to .LBB0_2. This path is more straightforward. We call operator new[], save the result (returned in %rax) to %rbx, and call copyUppercase and puts as before. We have a separate function epilogue for this case, and it looks a bit different:

        movq    %rbx, %rdi
        leaq    -24(%rbp), %rsp
        popq    %rbx
        popq    %r14
        popq    %r15
        popq    %rbp
        jmp     operator delete[](void*)                          # TAILCALL

The first mov sets up %rdi with a pointer to our heap-allocated array that we saved earlier. As with the other function epilogue, we then restore the stack pointer and pop our saved registers. Finally, we have a new instruction: jmp operator delete[](void *). jmp is just like goto: it transfers control to the given label or function. Unlike callq, it does not push the return address onto the stack. So, when operator delete[] returns, it will instead transfer control to printUpperCase’s caller. In essence, we’ve combined a callq to operator delete[] with our own retq. This is called tail call optimization, hence the # TAILCALL comment helpfully emitted by the compiler.

Practical application: catching surprising conversions

I said in the introduction that reading assembly makes implicit copy and destroy operations abundantly clear. We saw some of that in our previous example, but I want to close by looking at a common C++ move semantics debate. Is it OK to take parameters by value in order to avoid having one overload for lvalue references and another overload for rvalue references? There is a school of thought that says “yes, because in the lvalue case you will make a copy anyway, and in the rvalue case it’s fine as long as your type is cheap to move”. If we take a look at an example for the rvalue case, we will see that “cheap to move” does not mean “free to move”, as much as we might prefer otherwise. If we want maximum performance, we can demonstrate that the overload solution will get us there and the by-value solution will not. (Of course, if we aren’t willing to write extra code to improve performance, then “cheap to move” is probably cheap enough.)

#include <string>

class MyString {
 std::string str;
 public:
  explicit MyString(const std::string& s);
  explicit MyString(std::string&& s);
};

class MyOtherString {
  std::string str;
 public:
  explicit MyOtherString(std::string s);
};

void createRvalue1(std::string&& s) {
    MyString s2(std::move(s));
};

void createRvalue2(std::string&& s) {
    MyOtherString s2(std::move(s));
};

If we look at the generated assembly⁹ (which is too long to include even though I’ve intentionally outlined¹⁰ the constructors in question), we can see that createRvalue1 does 1 move operation (inside the body of MyString::MyString(std::string&&)) and 1 std::string::~string() call (the operator delete before returning). In contrast, createRvalue2 is much longer: it does a total of 2 move operations (1 inline, into the s parameter for MyOtherString::MyOtherString(std::string s), and 1 in the body of that same constructor) and 2 std::string::~string calls (1 for the aforementioned s parameter and 1 for the MyOtherString::str member). To be fair, moving std::string is cheap and so is destroying a moved-from std::string, but it is not free in terms of either CPU time or code size.

Assembly language dates back to the late 1940s, so there are plenty of resources for learning about it. Personally, my first introduction to assembly language was in the EECS 370: Introduction to Computer Organization junior-level course at my alma mater, the University of Michigan. Unfortunately, most of the course materials linked on that website are not public. Here are what appear to be the corresponding “how computers really work” courses at Berkeley (CS 61C), Carnegie Mellon (15-213), Stanford (CS107), and MIT (6.004). (Please let me know if I’ve suggested the wrong course for any of thse schools!) Nand to Tetris also appears to cover similar material, and the projects and book chapters are freely available.

My first practical exposure to x86 assembly in particular was in the context of security exploits, or learning to become a “l33t h4x0r”, as the kids used to say. If this strikes you as a more entertaining reason to learn about assembly, great! The classic work in the space is Smashing the Stack for Fun and Profit. Unfortunately, modern security mitigations complicate running the examples in that article on your own, so I recommend finding a more modern practice environment. Microcorruption is an industry-created example, or you could try finding an application security project from a college security course to follow along with (e.g., Project 1 from Berkeley’s CS 161, which seems to be publicly available currently).

Finally, there is always Google and Hacker News. Pat Shaughnessy’s “Learning to Read x86 Assembly Language from 2016 covers the topic from the perspective of Ruby and Crystal, and there was also a recent (2020) discussion on how to learn x86_64 assembly.

Good luck, and happy hacking!

I use AT&T syntax because it’s the default syntax in Linux tools. If you prefer Intel syntax, the toggle is on Compiler Explorer under “Output”. Compiler Explorer links in this article will show both, with AT&T on the left and Intel on the right. Guides to the differences are short and readily available; briefly, Intel syntax is more explicit about memory references, drops the b/w/l/q suffixes, and puts the destination operand first instead of last. ↩︎
If you actually look at the ARM64 assembly for this example, you’ll see an madd instruction get used instead: madd x0, x0, x0, x8. This is a multiply+add in one instruction: it’s doing x0 = x0 * x0 + x8. ↩︎
These are just the 64-bit registers used by most integer instructions. There are actually a lot more registers that came with floating point and instruction set extensions. ↩︎
You may have noticed that we only used 16 bytes of stack space despite allocating 24. As far as I can tell, the extra 8 bytes are left over from the code to set up and restore the frame pointer, which was optimized out. Clang, gcc, and icc all seem to leave the extra 8 bytes in, and msvc seems to waste 16 bytes instead of 8. If we build with -fno-omit-frame-pointer, we can see that the other 8 bytes are used to pushq %rbp at the start of the function and later popq %rbp at the end. Compilers aren’t perfect; you will see this sort of small missed optimization from time to time if you read assembly a lot. Sometimes things really are missed optimization opportunities, but there are also lots of unfortunate ABI constraints that force suboptimal code generation for reasons of compatibility between pieces of code built with different compilers (or even different versions of the same compiler). UPDATE: The extra 8 bytes of stack space are because Section 3.2.2 of the System V x86_64 ABI requires that the stack frame must be aligned to a 16-byte boundary when calling a function. In other words, every compiler made this “mistake” because it’s required! ↩︎
Also suppose that we don’t have something like absl::FixedArray available. I didn’t want to complicate the example any further. ↩︎
I built with -fno-exceptions to simplify the example by removing the exception cleanup path. It appears right after a tail call, which I think might be confusing. ↩︎
Another possible missed optimization: I don’t see a need to pushq %rax here; it’s not callee-saved and we don’t care about the value on entry to printUpperCase. Get in touch if you know whether this is a missed optimization or there’s actually a reason to do it! UPDATE: this is likely because pushing a register is smaller and/or faster than issuing a sub 8, %rsp instruction. ↩︎
Yet again, I think that this movq %rsp, %r15 is not needed. %r15 is not used again until we movq %r15, %rsp, but that instruction is immediately followed by leaq -24(%rbp), %rsp, which overwrites %rsp immediately. I think that we could improve the code by removing the two movq %rsp, %r15 and movq %r15, %rsp instructions. On the other hand, Intel’s icc compiler also does seemingly silly things to restore %rsp given this code, so either there is a good reason to do them, or cleaning up stack pointer manipulations in the presence of variable-length arrays is just a hard or neglected problem in compilers. Again, feel free to reach out if you know which it is! ↩︎
Again, I built with -fno-exceptions to avoid complicating things with exception cleanup paths. ↩︎
If we inline the constructors for MyString and MyOtherString, we do get some savings in createRvalue2: we call operator delete at most once. However, we still do 2 move operations and we require 32 extra bytes of stack space. ↩︎

Scott Wolchok