Scott Wolchok

How to Read ARM64 Assembly Language

Posted at — Mar 14, 2021

ARM64 is a computer architecture that competes with the popular Intel x86-64 architecture used for the CPUs in desktops, laptops, and so on. ARM64 is common in mobile phones1, as well as Graviton-based Amazon EC2 instances, the Raspberry Pi 3 and 4, and the much ballyhooed Apple M1 chips, so knowing about it might be useful! In fact, I have almost certainly spent more time with ARM64 than x86-64 because of the iPhone.

This post is an alternate version of my previous post on How to Read Assembly Language. It walks through the same examples, showing ARM64 assembly instead. Background content like explanations of instructions and registers is also rehashed for your reading convenience.

Instructions

The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. Unlike x86-64, each ARM64 instruction is exactly 4 bytes long, so you can tell how much memory a piece of ARM64 code takes up just by counting instructions.

Example 1: Vector norm

Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
};

int64_t normSquared(Vec2 v) {
    return v.x * v.x + v.y * v.y;
}

and here is the resulting ARM64 assembly from clang 11:

        mul     x8, x1, x1
        madd    x0, x0, x0, x8
        ret

The first instruction, mul x8, x1, x1, performs multiplication. Unlike the x86-64 assembly syntax we used previously, the destination operand is on the left. This mul instruction squares the contents of x1 and stores the result into x8.

Next, we have madd x0, x0, x0, x8. madd stands for “multiply-add”: it squares x0, adds x8, and stores the result in x0.

Finally, ret returns from normSquared.

Registers

Let’s take a brief detour to explain what the registers we saw in our example are. Registers are the “variables” of assembly langauge. Unlike variables in your favorite programming language (probably), there are a finite number of them, they have standardized names, and the ones we’ll be talking about are at most 64 bits in size. ARM64 has 31 general-purpose registers named x0 through x30. To refer to their lower 32 bits instead of the full 64 bits, we can write w0 through w30. There is also a dedicated sp (stack pointer) register. Full documentation for core register names is on ARM’s website.

Example 2: The stack

Now, let’s extend our example to debug print the Vec2 in normSquared:

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
    void debugPrint() const;
};

int64_t normSquared(Vec2 v) {
    v.debugPrint();
    return v.x * v.x + v.y * v.y;
}

and, again, let’s see the generated assembly:

        sub     sp, sp, #32
        stp     x29, x30, [sp, #16]
        add     x29, sp, #16
        stp     x0, x1, [sp]
        mov     x0, sp
        bl      Vec2::debugPrint() const
        ldp     x8, x9, [sp]
        ldp     x29, x30, [sp, #16]
        mul     x8, x8, x8
        madd    x0, x9, x9, x8
        add     sp, sp, #32
        ret

We start off with a new register: sp. Like %rsp on x86-64, it is the “stack pointer”, used to maintain the function call stack. It points to the bottom of the stack, which grows “down” (toward lower addresses) on ARM64. So, our sub sp, sp, #32 instruction is making space for four 64-bit integers on the stack by SUBtracting from the stack pointer. Next, stp x29, x30, [sp, #16] is SToring a Pair of registers: it is saving the old frame pointer (x29) and link register (x30 – it contains the return address, as we’ll see below) on the stack starting at the address sp + 16. (The square brackets denote a memory access.) We calculate the new frame pointer with add x29, sp, #16; it is required to point to the previously-saved frame pointer and stack pointer. This concludes the 3-instruction function prologue.

Then, the following stp x0, x1, [sp] instruction stores the first and second arguments to normSquared, which are v.x and v.y, to the stack, effectively creating a copy of v in memory at the address in sp. Next, we put a pointer to that copy of v in x0 with mov x0, sp and call Vec2::debugPrint() const with bl. bl is a mnemonic for “branch with link”, and it works slightly differently from the x86-64 call instruction: rather than pushing the return address onto the stack, it saves it in register x30, also known as the link register or lr.

After debugPrint has returned, we LoaD the Pair of registers r8 and r9 with v.x and v.y from the stack. We also restore the old values of the frame pointer and stack pointer. Then, we have the same mul and madd instructions as in the previous example. Finally , we add sp, sp, #32 to clean up the 32 bytes of stack space we allocated at the start of our function (called the function epilogue; I would include the load of the old frame pointer and stack pointer even though it happened to come before the mul & madd) and then return to our caller with ret.

Example 3: Control flow

Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings.2 We might write something like the following:

#include <cstdio>
#include <cstring>
#include <memory>

void copyUppercase(char *dest, const char *src);

constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;

void printUpperCase(const char *s) {
    auto sSize = strlen(s);
    if (sSize <= MAX_STACK_ARRAY_SIZE) {
        char temp[sSize + 1];
        copyUppercase(temp, s);
        puts(temp);
    } else {
        // std::make_unique_for_overwrite is missing on Godbolt.
        std::unique_ptr<char[]> temp(new char[sSize + 1]);
        copyUppercase(temp.get(), s);
        puts(temp.get());
    }
}

Here is the generated assembly:3

        stp     x29, x30, [sp, #-48]!           // 16-byte Folded Spill
        str     x21, [sp, #16]                  // 8-byte Folded Spill
        stp     x20, x19, [sp, #32]             // 16-byte Folded Spill
        mov     x29, sp
        mov     x19, x0
        bl      strlen
        cmp     x0, #1024                       // =1024
        add     x0, x0, #1                      // =1
        b.hi    .LBB0_2
        add     x9, x0, #15                     // =15
        mov     x8, sp
        and     x9, x9, #0xfffffffffffffff0
        sub     x20, x8, x9
        mov     x21, sp
        mov     sp, x20
        mov     x0, x20
        mov     x1, x19
        bl      copyUppercase(char*, char const*)
        mov     x0, x20
        bl      puts
        mov     sp, x21
        mov     sp, x29
        ldp     x20, x19, [sp, #32]             // 16-byte Folded Reload
        ldr     x21, [sp, #16]                  // 8-byte Folded Reload
        ldp     x29, x30, [sp], #48             // 16-byte Folded Reload
        ret
.LBB0_2:
        bl      operator new[](unsigned long)
        mov     x1, x19
        mov     x20, x0
        bl      copyUppercase(char*, char const*)
        mov     x0, x20
        bl      puts
        mov     x0, x20
        mov     sp, x29
        ldp     x20, x19, [sp, #32]             // 16-byte Folded Reload
        ldr     x21, [sp, #16]                  // 8-byte Folded Reload
        ldp     x29, x30, [sp], #48             // 16-byte Folded Reload
        b       operator delete[](void*)

Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:

        stp     x29, x30, [sp, #-48]!           // 16-byte Folded Spill
        str     x21, [sp, #16]                  // 8-byte Folded Spill
        stp     x20, x19, [sp, #32]             // 16-byte Folded Spill
        mov     x29, sp

As we saw before, we are saving the old frame pointer and stack pointer to the stack. However, we are doing it using a more complicated store instruction: stp x29, x30, [sp, #-48]! does two things. First, it stores x29 and x30 to the address sp - 48. Second, it updates the stack pointer with that same sp - 48 value (that’s what the exclamation point is for; it’s the “pre-index addressing mode” described in ARM’s documentation).

Next, we save x21, x20, and x19 to the stack; we will use them later and we are required to preserve their current values (in other words, they are “callee-saved” registers). Finally, we set up the new frame pointer in x29.

(By the way, the term “spill” in the compiler-generated comments just means that we are saving registers to the stack.)

On to the function body:

        mov     x19, x0
        bl      strlen
        cmp     x0, #1024                       // =1024
        add     x0, x0, #1                      // =1
        b.hi    .LBB0_2

We save our argument, s (stored in x0) in x19 and call strlen with bl, as we saw before. When strlen returns, we CoMPare its result against 1024 as the first step in our if statement. This sets the NZCV register according to the result of the comparsion, and then b.hi .LBB0_2 Branches to .LBB0_2 if it turns out that x0 was in fact more than 1024. Because both branches of our if statement care about sSize + 1 and not sSize, we add 1 to x0 (which stores sSize) before the branch. In general, higher-level control-flow primitives like if/else statements and loops are implemented in assembly using conditional jump instructions.

Let’s first look at the path where x0 <= 1024 and thus the branch to .LBB0_2 was not taken. We have a blob of instructions to create char temp[sSize + 1] on the stack:

        add     x9, x0, #15                     // =15
        mov     x8, sp
        and     x9, x9, #0xfffffffffffffff0
        sub     x20, x8, x9
        mov     x21, sp
        mov     sp, x20

We add 15 to x0 and put the result in x9. Then, we mask off the lower 4 bits of x9. Together, these two operations put the target array size rounded up to the next multiple of 16 into x9. Then, we subtract the array size from the stack pointer, save the old stack pointer value into x214, and set the new stack pointer value.

The following block simply calls copyUppercase and puts as written in the code:

        mov     x0, x20
        mov     x1, x19
        bl      copyUppercase(char*, char const*)
        mov     x0, x20
        bl      puts

Finally, we have our function epilogue:

        mov     sp, x21
        mov     sp, x29
        ldp     x20, x19, [sp, #32]             // 16-byte Folded Reload
        ldr     x21, [sp, #16]                  // 8-byte Folded Reload
        ldp     x29, x30, [sp], #48             // 16-byte Folded Reload
        ret

We restore the stack pointer using the value of the frame pointer. Then, we load the registers we previously saved to the stack. Here we’ve see a new “post-index” addresing mode: ldp x29, x30, [sp], #48 means to load x29 and x30 from the current value of the stack pointer, and then add 48 to it afterwards. Finally, we return control to our caller, and we are done.

Next, let’s take a look at the path when x0 > 1024 and we branch to .LBB0_2 to allocate our array on the heap. This path is more straightforward. We call operator new[], save the result (returned in x0) into x20, and call copyUppercase and puts as before. We have a separate function epilogue for this case, and it looks a bit different:

        mov     x0, x20
        mov     sp, x29
        ldp     x20, x19, [sp, #32]             // 16-byte Folded Reload
        ldr     x21, [sp, #16]                  // 8-byte Folded Reload
        ldp     x29, x30, [sp], #48             // 16-byte Folded Reload
        b       operator delete[](void*)

The forst mov sets up x0 with a pointer to our heap-allocated array that we saved earlier. As with the other function epilogue, we then restore the stack pointer, load our saved registers, and update it by adding 48 bytes back. Finally, we have a new instruction: b operator delete[](void*). b (for “branch”) is just like goto: it transfers control to the given label or function. Unlike bl, it does not save the return address for a future ret. So, when operator delete[] returns, it will instead transfer control to printUpperCase’s caller. In essence, we’ve combined a bl to opreator delete[] with our own ret. This is called tail call optimization.

Further reading

Assembly language dates back to the late 1940s, so there are plenty of resources for learning about it. Personally, my first introduction to assembly language was in the EECS 370: Introduction to Computer Organization junior-level course at my alma mater, the University of Michigan. Unfortunately, most of the course materials linked on that website are not public. Here are what appear to be the corresponding “how computers really work” courses at Berkeley (CS 61C), Carnegie Mellon (15-213), Stanford (CS107), and MIT (6.004). (Please let me know if I’ve suggested the wrong course for any of thse schools!) Nand to Tetris also appears to cover similar material, and the projects and book chapters are freely available.

My first practical exposure to ARM64 assembly in particular was through iPhone development. I already knew the general way assembly works from previous exposure in college, so I got started by just googling “ARM64 ldp instruction” (or whatever other instruction) each time and reading what it did. Over time, I remembered what I had learned and didn’t have to Google again.

If you would like a more technical walkthrough of ARM64 assembly language, there is also a “learn the architecture” guide on ARM’s website. It may help you to know that the official name for the architecture is actually AArch64, but “ARM64” seems to be much more common.


  1. Specifically, iPhones since the iPhone 5S have used ARM64, and apparently a huge majority of Android phones do too. ↩︎

  2. Also suppose that we don’t have something like absl::FixedArray available. I didn’t want to complicate the example any further. ↩︎

  3. I built with -fno-exceptions to simplify the example by removing the exception cleanup path. It appears right after a tail call, which I think might be confusing. ↩︎

  4. Just like we saw in the x86-64 version of this article, I think that this mov x21, sp is not needed. x21 is not used again until we mov sp, x21, but that instruction is immediately followed by mov sp, x19, which overwrites sp. I think that we could improve the code by removing the move to and from x21. ↩︎