Scott Wolchok
ARM64 is a computer architecture that competes with the popular Intel x86-64 architecture used for the CPUs in desktops, laptops, and so on. ARM64 is common in mobile phones1, as well as Graviton-based Amazon EC2 instances, the Raspberry Pi 3 and 4, and the much ballyhooed Apple M1 chips, so knowing about it might be useful! In fact, I have almost certainly spent more time with ARM64 than x86-64 because of the iPhone.
This post is an alternate version of my previous post on How to Read Assembly Language. It walks through the same examples, showing ARM64 assembly instead. Background content like explanations of instructions and registers is also rehashed for your reading convenience.
The basic unit of assembly language is the instruction. Each machine instruction is a small operation, like adding two numbers, loading some data from memory, jumping to another memory location (like the dreaded goto statement), or calling or returning from a function. Unlike x86-64, each ARM64 instruction is exactly 4 bytes long, so you can tell how much memory a piece of ARM64 code takes up just by counting instructions.
Our first toy example will get us acquainted with simple instructions. It just calculates the square of the norm of a 2D vector:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
};
int64_t normSquared(Vec2 v) {
return v.x * v.x + v.y * v.y;
}
and here is the resulting ARM64 assembly from clang 11:
mul x8, x1, x1
madd x0, x0, x0, x8
ret
The first instruction, mul x8, x1, x1
, performs
multiplication. Unlike
the x86-64 assembly syntax we used previously, the destination operand is
on the left. This mul
instruction squares the contents of x1
and
stores the result into x8
.
Next, we have madd x0, x0, x0, x8
. madd
stands for “multiply-add”: it squares x0
, adds x8
, and stores the
result in x0
.
Finally, ret
returns from normSquared
.
Let’s take a brief detour to explain what the registers we saw in our
example are. Registers are the “variables” of assembly
langauge. Unlike variables in your favorite programming language
(probably), there are a finite number of them, they have standardized
names, and the ones we’ll be talking about are at most 64 bits in
size. ARM64 has 31 general-purpose registers named x0
through
x30
. To refer to their lower 32 bits instead of the full 64 bits, we
can write w0
through w30
. There is also a dedicated sp
(stack
pointer)
register. Full
documentation for core register names is on ARM’s
website.
Now, let’s extend our example to debug print the Vec2
in normSquared
:
#include <cstdint>
struct Vec2 {
int64_t x;
int64_t y;
void debugPrint() const;
};
int64_t normSquared(Vec2 v) {
v.debugPrint();
return v.x * v.x + v.y * v.y;
}
and, again, let’s see the generated assembly:
sub sp, sp, #32
stp x29, x30, [sp, #16]
add x29, sp, #16
stp x0, x1, [sp]
mov x0, sp
bl Vec2::debugPrint() const
ldp x8, x9, [sp]
ldp x29, x30, [sp, #16]
mul x8, x8, x8
madd x0, x9, x9, x8
add sp, sp, #32
ret
We start off with a new register: sp
. Like %rsp
on x86-64, it is
the “stack pointer”, used to maintain the function call
stack. It points to the
bottom of the stack, which grows “down” (toward lower addresses) on
ARM64. So, our sub sp, sp, #32
instruction is making space for four
64-bit integers on the stack by SUBtracting from the stack
pointer. Next, stp x29, x30, [sp, #16]
is SToring a
Pair
of registers: it is saving the old frame pointer (x29
) and link
register (x30
– it contains the return address, as we’ll see below)
on the stack starting at the address sp + 16
. (The square brackets
denote a memory access.) We calculate the new frame pointer with add x29, sp, #16
; it is required to point to the previously-saved frame pointer
and stack pointer. This concludes the 3-instruction function
prologue.
Then, the following stp x0, x1, [sp]
instruction stores the first
and second arguments to normSquared
, which are v.x
and v.y
, to
the stack, effectively creating a copy of v
in memory at the address
in sp
. Next, we put a pointer to that copy of v
in x0
with mov x0, sp
and call Vec2::debugPrint() const
with bl
. bl
is a
mnemonic for “branch with
link”,
and it works slightly differently from the x86-64 call
instruction:
rather than pushing the return address onto the stack, it saves it in
register x30
, also known as the link register or lr
.
After debugPrint
has returned, we LoaD the
Pair
of registers r8
and r9
with v.x
and v.y
from the stack. We
also restore the old values of the frame pointer and stack
pointer. Then, we have the same mul
and madd
instructions as in
the previous example. Finally , we add sp, sp, #32
to clean up the
32 bytes of stack space we allocated at the start of our function
(called the function
epilogue; I
would include the load of the old frame pointer and stack pointer even
though it happened to come before the mul
& madd
) and then return
to our caller with ret
.
Now, let’s look at a different example. Suppose that we want to print an uppercased C string and we’d like to avoid heap allocations for smallish strings.2 We might write something like the following:
#include <cstdio>
#include <cstring>
#include <memory>
void copyUppercase(char *dest, const char *src);
constexpr size_t MAX_STACK_ARRAY_SIZE = 1024;
void printUpperCase(const char *s) {
auto sSize = strlen(s);
if (sSize <= MAX_STACK_ARRAY_SIZE) {
char temp[sSize + 1];
copyUppercase(temp, s);
puts(temp);
} else {
// std::make_unique_for_overwrite is missing on Godbolt.
std::unique_ptr<char[]> temp(new char[sSize + 1]);
copyUppercase(temp.get(), s);
puts(temp.get());
}
}
Here is the generated assembly:3
stp x29, x30, [sp, #-48]! // 16-byte Folded Spill
str x21, [sp, #16] // 8-byte Folded Spill
stp x20, x19, [sp, #32] // 16-byte Folded Spill
mov x29, sp
mov x19, x0
bl strlen
cmp x0, #1024 // =1024
add x0, x0, #1 // =1
b.hi .LBB0_2
add x9, x0, #15 // =15
mov x8, sp
and x9, x9, #0xfffffffffffffff0
sub x20, x8, x9
mov x21, sp
mov sp, x20
mov x0, x20
mov x1, x19
bl copyUppercase(char*, char const*)
mov x0, x20
bl puts
mov sp, x21
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
ret
.LBB0_2:
bl operator new[](unsigned long)
mov x1, x19
mov x20, x0
bl copyUppercase(char*, char const*)
mov x0, x20
bl puts
mov x0, x20
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
b operator delete[](void*)
Our function prologue has gotten a lot longer, and we have some new control flow instructions as well. Let’s take a closer look at the prologue:
stp x29, x30, [sp, #-48]! // 16-byte Folded Spill
str x21, [sp, #16] // 8-byte Folded Spill
stp x20, x19, [sp, #32] // 16-byte Folded Spill
mov x29, sp
As we saw before, we are saving the old frame pointer and stack
pointer to the stack. However, we are doing it using a more
complicated store instruction: stp x29, x30, [sp, #-48]!
does two
things. First, it stores x29
and x30
to the address sp - 48
. Second, it updates the stack pointer with that same sp - 48
value
(that’s what the exclamation point is for; it’s the “pre-index
addressing mode” described in ARM’s
documentation).
Next, we save x21
, x20
, and x19
to the stack; we will use them
later and we are required to preserve their current values (in other
words, they are “callee-saved” registers). Finally, we set up the new
frame pointer in x29
.
(By the way, the term “spill” in the compiler-generated comments just means that we are saving registers to the stack.)
On to the function body:
mov x19, x0
bl strlen
cmp x0, #1024 // =1024
add x0, x0, #1 // =1
b.hi .LBB0_2
We save our argument, s
(stored in x0
) in
x19
and call strlen
with bl
, as we saw before. When strlen
returns, we
CoMPare
its result against 1024 as the first step in our if
statement. This
sets the NZCV
register
according to the result of the comparsion, and then b.hi .LBB0_2
Branches
to .LBB0_2
if it turns out that x0
was in fact more
than 1024. Because both branches of our if
statement care about
sSize + 1
and not sSize
, we add 1 to x0
(which stores sSize
)
before the branch. In general, higher-level control-flow primitives
like if
/else
statements and loops are implemented in assembly
using conditional jump instructions.
Let’s first look at the path where x0 <= 1024
and thus the branch to
.LBB0_2
was not taken. We have a blob of instructions to create
char temp[sSize + 1]
on the stack:
add x9, x0, #15 // =15
mov x8, sp
and x9, x9, #0xfffffffffffffff0
sub x20, x8, x9
mov x21, sp
mov sp, x20
We add 15 to x0
and put the result in x9
. Then, we mask off the
lower 4 bits of x9
. Together, these two operations put the target
array size rounded up to the next multiple of 16 into x9
. Then, we
subtract the array size from the stack pointer, save the old stack
pointer value into x21
4, and set the
new stack pointer value.
The following block simply calls copyUppercase
and puts
as written in the code:
mov x0, x20
mov x1, x19
bl copyUppercase(char*, char const*)
mov x0, x20
bl puts
Finally, we have our function epilogue:
mov sp, x21
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
ret
We restore the stack pointer using the value of the frame
pointer. Then, we load the registers we previously saved to the
stack. Here we’ve see a new “post-index” addresing mode: ldp x29, x30, [sp], #48
means to load x29
and x30
from the current value
of the stack pointer, and then add 48 to it afterwards. Finally, we
return control to our caller, and we are done.
Next, let’s take a look at the path when x0 > 1024
and we branch to
.LBB0_2
to allocate our array on the heap. This path is more
straightforward. We call operator new[]
, save the result (returned
in x0
) into x20
, and call copyUppercase
and puts
as before. We
have a separate function epilogue for this case, and it looks a bit
different:
mov x0, x20
mov sp, x29
ldp x20, x19, [sp, #32] // 16-byte Folded Reload
ldr x21, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #48 // 16-byte Folded Reload
b operator delete[](void*)
The forst mov
sets up x0
with a pointer to our heap-allocated
array that we saved earlier. As with the other function epilogue, we
then restore the stack pointer, load our saved registers, and update
it by adding 48 bytes back. Finally, we have a new instruction: b operator delete[](void*)
. b
(for “branch”) is just like goto
: it
transfers control to the given label or function. Unlike bl
, it does
not save the return address for a future ret
. So, when operator delete[]
returns, it will instead transfer control to
printUpperCase
’s caller. In essence, we’ve combined a bl
to
opreator delete[]
with our own ret
. This is called tail call
optimization.
Assembly language dates back to the late 1940s, so there are plenty of resources for learning about it. Personally, my first introduction to assembly language was in the EECS 370: Introduction to Computer Organization junior-level course at my alma mater, the University of Michigan. Unfortunately, most of the course materials linked on that website are not public. Here are what appear to be the corresponding “how computers really work” courses at Berkeley (CS 61C), Carnegie Mellon (15-213), Stanford (CS107), and MIT (6.004). (Please let me know if I’ve suggested the wrong course for any of thse schools!) Nand to Tetris also appears to cover similar material, and the projects and book chapters are freely available.
My first practical exposure to ARM64 assembly in particular was through iPhone development. I already knew the general way assembly works from previous exposure in college, so I got started by just googling “ARM64 ldp instruction” (or whatever other instruction) each time and reading what it did. Over time, I remembered what I had learned and didn’t have to Google again.
If you would like a more technical walkthrough of ARM64 assembly language, there is also a “learn the architecture” guide on ARM’s website. It may help you to know that the official name for the architecture is actually AArch64, but “ARM64” seems to be much more common.
Specifically, iPhones since the iPhone 5S have used ARM64, and apparently a huge majority of Android phones do too. ↩︎
Also suppose that we don’t have something like absl::FixedArray available. I didn’t want to complicate the example any further. ↩︎
I built with -fno-exceptions
to simplify
the example by removing the exception cleanup path. It appears
right after a tail call, which I think might be confusing. ↩︎
Just like we
saw
in the x86-64 version of this article, I think that this mov x21, sp
is not needed. x21
is not used again until we mov sp, x21
, but that instruction is immediately followed by mov sp, x19
, which overwrites sp
. I think that we could improve the
code by removing the move to and from x21
. ↩︎