Learn to Read Dissasembly for Fun and Profit

Learning to read dissasembly can be daunting for the uninitiated. The output is foreign, barely annotated, and has little meaning without some background. This series isn’t so much about learning assembly from start to finish, but learning enough to get by, debug some things, and possibly optimize some things. More importantly, it should be enough so that you’ll be able to reverse engineer, learn, or otherwise figure out whatever you might come across when it comes to reading disassembly. Note that this series of posts will only cover the C family of languages, since there are particular conventions that C-based runtimes follow. However, you should be able to figure things out even then. This series is geared toward the novice or self-taught programmer, but may be a suitable review or hold something interesting for those experienced with the material already. The proposed articles are the following:

  1. Part I: Whirlwind Tour
  2. Part II: Virtual Function Dispatch
  3. Part III: JIT Compilers (and how to write your own)
  4. Part IV: Alternative Runtimes (Haskell and Rust)
  5. Part V: Assembly Optimizations

So without further ado…

Part I: Whirlwind Tour

Preliminaries: The empty main function.

Make a file called main.cpp with the following contents:

1
2
3
int main() {
    return 0;
}

Then, run g++ -std=c++11 main.cpp -S -c. The -c flag tells the compiler not to run the linker and -S says to output the results produced by the assembler, in this case, the GNU assembler (GAS). You are encouraged to try this with Clang or MSVC, but do note that your results will differ. Also, I am running this against g++ 4.9.1 running on Intel 64-bit hardware. The result should be a file that looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
    .file    "main.cpp"
    .text
    .globl   main
    .type    main, @function
main:
.LFB0:
    .cfi_startproc
    pushq    %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size    main, .-main
    .ident   "GCC: (GNU) 4.9.1 20140903 (prerelease)"
    .section .note.GNU-stack,"",@progbits

In the future, I won’t reproduce all the lines here for brevity, but this gives us a good baseline to start with. All the lines in green followed by colons are labels and do not contain instructions (although they can be used by them as we shall see later on). The instructions beginning with a .cfi are “call frame instructions” with significance to stack unwinding and exception handling. Other instructions that start with a dot are assembler directives that annotate how the file should be assembled. For now, if we ignore the assembler directives and look at the code following the lable main, we see the important bits.

1
2
3
4
5
6
main:
    pushq    %rbp
    movq     %rsp, %rbp
    movl     $0, %eax
    popq     %rbp
    ret

Any values that look like %rbp or %eax are registers. They are sized chunks of memory that live on the CPU. Registries that start with an r are 64-bits in width, and registries that start with e are 32-bits in width. The q suffixes on the instructions themselves refer to “quad-words” indicating that it is a 64-bit instruction. The l suffixes denote 32-bit instructions. Here’s the code again with some inline annotation.

1
2
3
4
5
6
main:
    pushq    %rbp       # save %rbp on the stack
    movq     %rsp, %rbp # store the value of %rsp in %rbp
    movl     $0, %eax   # store the value 0 in %eax
    popq     %rbp       # restore %rbp with the value saved on the stack
    ret                 # return from this function

So there are a few mysteries here. What is the point of the juggling done with %rbp and %rsp. What does ret actually do? To answer these, let’s modify our program to have main call a different function.

Function Calling

1
2
3
4
5
6
void foo(int, int) {}

int main() {
    foo(4, 6);
    return 0;
}

If you look at the disassembly now, you should see a new global called _Z3fooii which should give you insight into how function names are generated (the two is correspond to the two integer arguments). The function foo in assembly looks like the following (again, without assembly directives):

1
2
3
4
5
6
7
8
_Z3fooii:
.LFB0:
  pushq   %rbp
  movq    %rsp, %rbp
  movl    %edi, -4(%rbp)
  movl    %esi, -8(%rbp)
  popq    %rbp
  ret

In our main function, we have an additional 3 lines to call foo:

1
2
3
movl  $6, %esi
movl  $4, %edi
call  _Z3fooii

As you can see, the same charade with %rbp and %rsp happens again. So we can infer that these registers have special significance when it comes to program flow. In fact, %rbp and %rsp are special registers that refer to the base pointer and stack pointer respectively. A stack frame can be thought of as the temporary working space of a procedure or function. On most architectures, the stack grows downward so lower memory addresses are “higher up” on the stack. We can see this in the lines:

1
2
movl  %edi, -4(%rbp)
movl  %esi, -8(%rbp)

where the parameters passed into foo are put on the stack 4 and 8 bytes below the base pointer, ready for consumption. They aren’t ever used however. Let’s change that by modifying foo to look like the following:

1
2
3
int foo(int a, int b) {
    return a + b;
}

Now, the assembly for the function will look like:

1
2
3
4
5
movl   %edi, -4(%rbp)
movl   %esi, -8(%rbp)
movl   -4(%rbp), %edx
movl   -8(%rbp), %eax
addl   %edx, %eax

So we move the contents of %edi and %esi onto the stack, move them from the stack to %edx and %eax, then add the contents of %edx into %eax. So our answer is in %eax! This was a fairly long-winded way of getting at the answer, but the important thing here to get used to the syntax. Also, note that %eax was also used as the return value from main. In the event that the returned value occupies 64-bits (like a pointer on a 64-bit system), %rax is used. Storing the results of a function in the %eax or %rax register is a convention of the C runtime. We’ll see that if we try to access the value. Change main to look like the following:

1
2
3
4
int main() {
    int x = foo(4, 6);
    return 0;
}

You should now see an assembly instructions in main that look like:

1
2
3
4
5
6
7
8
9
10
11
12
main:
.LFB1:
  pushq   %rbp
  movq    %rsp, %rbp
  subq    $16, %rsp      # This is new
  movl    $6, %esi
  movl    $4, %edi
  call    _Z3fooii
  movl    %eax, -4(%rbp) # Getting our answer from foo(4, 6)
  movl    $0, %eax
  leave                  # This is new too
  ret

Sure enough, the assembler knew to look for the output of function foo in the %eax register, and copied it into the stack for later consumption (although it is never used). There are some new lines too though. Why are we subtracting 16 from the stack pointer and calling leave at the end? The short answer is that we stack pointer is moved (allocating space on the stack) when a function needs to reserve local data. An alternative of course, is keeping track of local data in all the available registers. This runs into the immediate problem of stomping on somebody else’s work when one function calls another. The leave instruction is syntactic sugar for mov %rsp %rbp followed by pop %rbp. This effectively releases the whole frame (all memory between %rsp and %rbp is gone) and resets the base pointer to the previous value on the stack. If this function was called from another one, the previous stack frame would have been restored. Intuitively, this is the inverse of what you see at the start of the function (the pushq and movq instructions).

Heap Memory Access

Let’s shift gears a bit. We’ve seen that we can store data in two ways so far. We can move raw data directly to a register (or presumable to a place on the stack. We can also move data onto the stack by decrementing the stack pointer (akin to allocating stack space) and referencing the memory therein via offsets to the stack base pointer (e.g. -4(%rbp) for 4 bytes at the beginning). Of course, this can’t be all there is since there only about a million bytes or so of stack space and a very limited number of registers to use. Let’s try interacting with the heap.

1
2
3
4
5
6
7
#include <cstdlib>
int main() {
    int *x = (int *) malloc(sizeof(int));
    int y = *x;
    free(x);
    return 0;
}

The relevant generated instructions are:

1
2
3
4
5
6
7
8
9
movl  $4, %edi
call  malloc
movq  %rax, -8(%rbp)
movq  -8(%rbp), %rax
movl  (%rax), %eax
movl  %eax, -12(%rbp)
movq  -8(%rbp), %rax
movq  %rax, %rdi
call  free

We see that malloc is called with a parameter of 4 in register %edi, the number of bytes to be allocated. This returns an 8-byte pointer in %rax which we immediately move into the stack. We then move the pointer from the stack back to %rax (super optimized right?). Note that because we are moving 8 bytes at a time in these instructions, we are using movq as opposed to movl.

The next bit is somewhat interesting. We move (%rax) into %eax. The parentheses are used to say we are interested in what %rax points to, not %rax itself. If instead, we had tried to copy the pointer (i.e. int *y = x;), the generated assembly would have been movq %rax, -16(%rbp). We then immediately move %eax into the stack.

Finally, we restore our pointer x to %rax, copy it to %rdi as a parameter to free, and call free to reclaim our memory. Already, you should start to appreciate the importance on having optimizations on in production code.

Let’s turn on the optimizer

If you compile the above code with optimizations on (i.e. passing -O3 to your compiler), you’ll find that essentially no code was generated. The optimizer knows that none of the data is actually used, so it throws away all the instructions used to manipulate them. To get the compiler to generate meaningful code with optimizations on, we’ll need to modify our program a bit.

1
2
3
4
5
6
7
8
#include <cstdlib>
int main(int argc, char* argv[]) {
    int *x = (int *) malloc(sizeof(int));
    *x = atoi(argv[1]);
    int y = *x;
    free(x);
    return y;
}

This program accepts input and returns it, so the compiler can’t possibly optimize away the instructions given. Below, I’ll reproduce the unoptimized, and optimized code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# unoptimized
main:
.LFB6:
  .cfi_startproc
  pushq   %rbp
  .cfi_def_cfa_offset 16
  .cfi_offset 6, -16
  movq    %rsp, %rbp
  .cfi_def_cfa_register 6
  subq    $32, %rsp
  movl    %edi, -20(%rbp)
  movq    %rsi, -32(%rbp)
  movl    $4, %edi
  call    malloc
  movq    %rax, -8(%rbp)
  movq    -32(%rbp), %rax
  addq    $8, %rax
  movq    (%rax), %rax
  movq    %rax, %rdi
  call    atoi
  movl    %eax, %edx
  movq    -8(%rbp), %rax
  movl    %edx, (%rax)
  movq    -8(%rbp), %rax
  movl    (%rax), %eax
  movl    %eax, -12(%rbp)
  movq    -8(%rbp), %rax
  movq    %rax, %rdi
  call    free
  movl    -12(%rbp), %eax
  leave
  .cfi_def_cfa 7, 8
  ret
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# optimized
main:
.LFB14:
  .cfi_startproc
  subq    $8, %rsp
  .cfi_def_cfa_offset 16
  movq    8(%rsi), %rdi
  movl    $10, %edx
  xorl    %esi, %esi
  call    strtol
  addq    $8, %rsp
  .cfi_def_cfa_offset 8
  ret
  .cfi_endproc

The optimized code is considerably shorter, clearly uses fewer temporaries, and throws in a few additional instructions. You’ll note that it barely touches the stack as well. In addition, the calls to malloc and free disappear. We also xor %esi with itself (effectively clearing it; an xor instruction is smaller and cheaper than movl $0) and our atoi function turns into a strtol (which atoi calls). All in all, not to shabby! The next example will demonstrate what is known as the return value optimization (RVO), and again, we’ll see the optimizer in action.

Investigating the RVO

Compile the following code with optimizations off, then on:

1
2
3
4
5
6
7
8
9
10
11
12
13
struct Foo {
    int foo;
    int bar;
    int zap[20];
};

Foo make_foo(int x, int y) {
    Foo f;
    f.foo = x + y;
    f.bar = x - y;
    f.zap[0] = x;
    return f;
}

The uninitiated might believe that this code will allocate a Foo on the stack, and then copy it out (somehow). Well, let’s see. You should get something akin to the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
_Z8make_fooii:
.LFB0:
  .cfi_startproc
  pushq   %rbp              # store original base pointer
  .cfi_def_cfa_offset 16
  .cfi_offset 6, -16
  movq    %rsp, %rbp        # set start of new stack frame
  .cfi_def_cfa_register 6
  movq    %rdi, -8(%rbp)    # What is this??
  movl    %esi, -12(%rbp)   # int x
  movl    %edx, -16(%rbp)   # int y
  movl    -12(%rbp), %edx
  movl    -16(%rbp), %eax
  addl    %eax, %edx        # adding x and y
  movq    -8(%rbp), %rax
  movl    %edx, (%rax)      # moving x+y into memory %rax points to
  movl    -12(%rbp), %eax
  subl    -16(%rbp), %eax   # x - y
  movl    %eax, %edx
  movq    -8(%rbp), %rax
  movl    %edx, 4(%rax)     # move x-y into %rax offset by 4 bytes
  movq    -8(%rbp), %rax
  movl    -12(%rbp), %edx
  movl    %edx, 8(%rax)     # move x into %rax offset by 8 bytes
  nop
  movq    -8(%rbp), %rax    # return the address of %rax
  popq    %rbp              # unwind the stack
  .cfi_def_cfa 7, 8
  ret
  .cfi_endproc

Clearly, even in the unoptimized version, this function expects a Foo to have been allocated by the calling site, with an address supplied by %rdi. So don’t be afraid to write code that does this. As an exercise, rewrite the function to take a reference to a Foo as a parameter and look at the assembly it generates.

The optimized assembly looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
_Z8make_fooii:
.LFB0:
  .cfi_startproc
  leal    (%rsi,%rdx), %ecx # compute %rsi + %rdx and store the result in %ecx
  movq    %rdi, %rax
  movl    %esi, 8(%rax)     # store x in the right spot
  movl    %ecx, (%rdi)      # store x + y in the right spot
  movl    %esi, %edi
  subl    %edx, %edi        # compute x - y
  movl    %edi, 4(%rax)     # store x - y in the right spot
  ret
  .cfi_endproc

There are a few more tricks to be sure, but we can sort of imagine how the unoptimized assembly could be “fuzed” into this one. The best way to understand how an optimizer works is to realize that by preserving invariants, the compiler can coalesce many instructions into one, and also elide needless register usage. All fusion operations are provably correct so the semantics of the program won’t change.

Conclusion

At this point, hopefully, you are more comfortable moving around in assembly and understanding what the compiler does. Also, the methodology used here to essentially reverse engineer the compiler is a perfectly valid way to learn how things work on your own. You should have some intuition on how optimizations work, and maybe even have an inkling for what to look for when things go wrong.

Future installments of this series will look at the C++ runtime more closely, in addition to other languages to compare and contrast. In addition, we’ll look at how JIT compilers work. Finally, we’ll try to go a bit more in depth into how these seemglingly magical optimizations occur. As always, please leave any feedback, criticisms, corrections, or comments below.

Comments