Learn to Read Dissasembly for Fun and Profit

Learning to read dissasembly can be daunting for the uninitiated. The output is foreign, barely annotated, and has little meaning without some background. This series isn’t so much about learning assembly from start to finish, but learning enough to get by, debug some things, and possibly optimize some things. More importantly, it should be enough so that you’ll be able to reverse engineer, learn, or otherwise figure out whatever you might come across when it comes to reading disassembly. Note that this series of posts will only cover the C family of languages, since there are particular conventions that C-based runtimes follow. However, you should be able to figure things out even then. This series is geared toward the novice or self-taught programmer, but may be a suitable review or hold something interesting for those experienced with the material already. The proposed articles are the following:

  1. Part I: Whirlwind Tour
  2. Part II: Virtual Function Dispatch
  3. Part III: JIT Compilers (and how to write your own)
  4. Part IV: Alternative Runtimes (Haskell and Rust)
  5. Part V: Assembly Optimizations

So without further ado…

Part I: Whirlwind Tour

Preliminaries: The empty main function.

Make a file called main.cpp with the following contents:

1
2
3
int main() {
    return 0;
}

Then, run g++ -std=c++11 main.cpp -S -c. The -c flag tells the compiler not to run the linker and -S says to output the results produced by the assembler, in this case, the GNU assembler (GAS). You are encouraged to try this with Clang or MSVC, but do note that your results will differ. Also, I am running this against g++ 4.9.1 running on Intel 64-bit hardware. The result should be a file that looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
    .file    "main.cpp"
    .text
    .globl   main
    .type    main, @function
main:
.LFB0:
    .cfi_startproc
    pushq    %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size    main, .-main
    .ident   "GCC: (GNU) 4.9.1 20140903 (prerelease)"
    .section .note.GNU-stack,"",@progbits

In the future, I won’t reproduce all the lines here for brevity, but this gives us a good baseline to start with. All the lines in green followed by colons are labels and do not contain instructions (although they can be used by them as we shall see later on). The instructions beginning with a .cfi are “call frame instructions” with significance to stack unwinding and exception handling. Other instructions that start with a dot are assembler directives that annotate how the file should be assembled. For now, if we ignore the assembler directives and look at the code following the lable main, we see the important bits.

1
2
3
4
5
6
main:
    pushq    %rbp
    movq     %rsp, %rbp
    movl     $0, %eax
    popq     %rbp
    ret

Any values that look like %rbp or %eax are registers. They are sized chunks of memory that live on the CPU. Registries that start with an r are 64-bits in width, and registries that start with e are 32-bits in width. The q suffixes on the instructions themselves refer to “quad-words” indicating that it is a 64-bit instruction. The l suffixes denote 32-bit instructions. Here’s the code again with some inline annotation.

1
2
3
4
5
6
main:
    pushq    %rbp       # save %rbp on the stack
    movq     %rsp, %rbp # store the value of %rsp in %rbp
    movl     $0, %eax   # store the value 0 in %eax
    popq     %rbp       # restore %rbp with the value saved on the stack
    ret                 # return from this function

So there are a few mysteries here. What is the point of the juggling done with %rbp and %rsp. What does ret actually do? To answer these, let’s modify our program to have main call a different function.

Function Calling

1
2
3
4
5
6
void foo(int, int) {}

int main() {
    foo(4, 6);
    return 0;
}

If you look at the disassembly now, you should see a new global called _Z3fooii which should give you insight into how function names are generated (the two is correspond to the two integer arguments). The function foo in assembly looks like the following (again, without assembly directives):

1
2
3
4
5
6
7
8
_Z3fooii:
.LFB0:
  pushq   %rbp
  movq    %rsp, %rbp
  movl    %edi, -4(%rbp)
  movl    %esi, -8(%rbp)
  popq    %rbp
  ret

In our main function, we have an additional 3 lines to call foo:

1
2
3
movl  $6, %esi
movl  $4, %edi
call  _Z3fooii

As you can see, the same charade with %rbp and %rsp happens again. So we can infer that these registers have special significance when it comes to program flow. In fact, %rbp and %rsp are special registers that refer to the base pointer and stack pointer respectively. A stack frame can be thought of as the temporary working space of a procedure or function. On most architectures, the stack grows downward so lower memory addresses are “higher up” on the stack. We can see this in the lines:

1
2
movl  %edi, -4(%rbp)
movl  %esi, -8(%rbp)

where the parameters passed into foo are put on the stack 4 and 8 bytes below the base pointer, ready for consumption. They aren’t ever used however. Let’s change that by modifying foo to look like the following:

1
2
3
int foo(int a, int b) {
    return a + b;
}

Now, the assembly for the function will look like:

1
2
3
4
5
movl   %edi, -4(%rbp)
movl   %esi, -8(%rbp)
movl   -4(%rbp), %edx
movl   -8(%rbp), %eax
addl   %edx, %eax

So we move the contents of %edi and %esi onto the stack, move them from the stack to %edx and %eax, then add the contents of %edx into %eax. So our answer is in %eax! This was a fairly long-winded way of getting at the answer, but the important thing here to get used to the syntax. Also, note that %eax was also used as the return value from main. In the event that the returned value occupies 64-bits (like a pointer on a 64-bit system), %rax is used. Storing the results of a function in the %eax or %rax register is a convention of the C runtime. We’ll see that if we try to access the value. Change main to look like the following:

1
2
3
4
int main() {
    int x = foo(4, 6);
    return 0;
}

You should now see an assembly instructions in main that look like:

1
2
3
4
5
6
7
8
9
10
11
12
main:
.LFB1:
  pushq   %rbp
  movq    %rsp, %rbp
  subq    $16, %rsp      # This is new
  movl    $6, %esi
  movl    $4, %edi
  call    _Z3fooii
  movl    %eax, -4(%rbp) # Getting our answer from foo(4, 6)
  movl    $0, %eax
  leave                  # This is new too
  ret

Sure enough, the assembler knew to look for the output of function foo in the %eax register, and copied it into the stack for later consumption (although it is never used). There are some new lines too though. Why are we subtracting 16 from the stack pointer and calling leave at the end? The short answer is that we stack pointer is moved (allocating space on the stack) when a function needs to reserve local data. An alternative of course, is keeping track of local data in all the available registers. This runs into the immediate problem of stomping on somebody else’s work when one function calls another. The leave instruction is syntactic sugar for mov %rsp %rbp followed by pop %rbp. This effectively releases the whole frame (all memory between %rsp and %rbp is gone) and resets the base pointer to the previous value on the stack. If this function was called from another one, the previous stack frame would have been restored. Intuitively, this is the inverse of what you see at the start of the function (the pushq and movq instructions).

Heap Memory Access

Let’s shift gears a bit. We’ve seen that we can store data in two ways so far. We can move raw data directly to a register (or presumable to a place on the stack. We can also move data onto the stack by decrementing the stack pointer (akin to allocating stack space) and referencing the memory therein via offsets to the stack base pointer (e.g. -4(%rbp) for 4 bytes at the beginning). Of course, this can’t be all there is since there only about a million bytes or so of stack space and a very limited number of registers to use. Let’s try interacting with the heap.

1
2
3
4
5
6
7
#include <cstdlib>
int main() {
    int *x = (int *) malloc(sizeof(int));
    int y = *x;
    free(x);
    return 0;
}

The relevant generated instructions are:

1
2
3
4
5
6
7
8
9
movl  $4, %edi
call  malloc
movq  %rax, -8(%rbp)
movq  -8(%rbp), %rax
movl  (%rax), %eax
movl  %eax, -12(%rbp)
movq  -8(%rbp), %rax
movq  %rax, %rdi
call  free

We see that malloc is called with a parameter of 4 in register %edi, the number of bytes to be allocated. This returns an 8-byte pointer in %rax which we immediately move into the stack. We then move the pointer from the stack back to %rax (super optimized right?). Note that because we are moving 8 bytes at a time in these instructions, we are using movq as opposed to movl.

The next bit is somewhat interesting. We move (%rax) into %eax. The parentheses are used to say we are interested in what %rax points to, not %rax itself. If instead, we had tried to copy the pointer (i.e. int *y = x;), the generated assembly would have been movq %rax, -16(%rbp). We then immediately move %eax into the stack.

Finally, we restore our pointer x to %rax, copy it to %rdi as a parameter to free, and call free to reclaim our memory. Already, you should start to appreciate the importance on having optimizations on in production code.

Let’s turn on the optimizer

If you compile the above code with optimizations on (i.e. passing -O3 to your compiler), you’ll find that essentially no code was generated. The optimizer knows that none of the data is actually used, so it throws away all the instructions used to manipulate them. To get the compiler to generate meaningful code with optimizations on, we’ll need to modify our program a bit.

1
2
3
4
5
6
7
8
#include <cstdlib>
int main(int argc, char* argv[]) {
    int *x = (int *) malloc(sizeof(int));
    *x = atoi(argv[1]);
    int y = *x;
    free(x);
    return y;
}

This program accepts input and returns it, so the compiler can’t possibly optimize away the instructions given. Below, I’ll reproduce the unoptimized, and optimized code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# unoptimized
main:
.LFB6:
  .cfi_startproc
  pushq   %rbp
  .cfi_def_cfa_offset 16
  .cfi_offset 6, -16
  movq    %rsp, %rbp
  .cfi_def_cfa_register 6
  subq    $32, %rsp
  movl    %edi, -20(%rbp)
  movq    %rsi, -32(%rbp)
  movl    $4, %edi
  call    malloc
  movq    %rax, -8(%rbp)
  movq    -32(%rbp), %rax
  addq    $8, %rax
  movq    (%rax), %rax
  movq    %rax, %rdi
  call    atoi
  movl    %eax, %edx
  movq    -8(%rbp), %rax
  movl    %edx, (%rax)
  movq    -8(%rbp), %rax
  movl    (%rax), %eax
  movl    %eax, -12(%rbp)
  movq    -8(%rbp), %rax
  movq    %rax, %rdi
  call    free
  movl    -12(%rbp), %eax
  leave
  .cfi_def_cfa 7, 8
  ret
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# optimized
main:
.LFB14:
  .cfi_startproc
  subq    $8, %rsp
  .cfi_def_cfa_offset 16
  movq    8(%rsi), %rdi
  movl    $10, %edx
  xorl    %esi, %esi
  call    strtol
  addq    $8, %rsp
  .cfi_def_cfa_offset 8
  ret
  .cfi_endproc

The optimized code is considerably shorter, clearly uses fewer temporaries, and throws in a few additional instructions. You’ll note that it barely touches the stack as well. In addition, the calls to malloc and free disappear. We also xor %esi with itself (effectively clearing it; an xor instruction is smaller and cheaper than movl $0) and our atoi function turns into a strtol (which atoi calls). All in all, not to shabby! The next example will demonstrate what is known as the return value optimization (RVO), and again, we’ll see the optimizer in action.

Investigating the RVO

Compile the following code with optimizations off, then on:

1
2
3
4
5
6
7
8
9
10
11
12
13
struct Foo {
    int foo;
    int bar;
    int zap[20];
};

Foo make_foo(int x, int y) {
    Foo f;
    f.foo = x + y;
    f.bar = x - y;
    f.zap[0] = x;
    return f;
}

The uninitiated might believe that this code will allocate a Foo on the stack, and then copy it out (somehow). Well, let’s see. You should get something akin to the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
_Z8make_fooii:
.LFB0:
  .cfi_startproc
  pushq   %rbp              # store original base pointer
  .cfi_def_cfa_offset 16
  .cfi_offset 6, -16
  movq    %rsp, %rbp        # set start of new stack frame
  .cfi_def_cfa_register 6
  movq    %rdi, -8(%rbp)    # What is this??
  movl    %esi, -12(%rbp)   # int x
  movl    %edx, -16(%rbp)   # int y
  movl    -12(%rbp), %edx
  movl    -16(%rbp), %eax
  addl    %eax, %edx        # adding x and y
  movq    -8(%rbp), %rax
  movl    %edx, (%rax)      # moving x+y into memory %rax points to
  movl    -12(%rbp), %eax
  subl    -16(%rbp), %eax   # x - y
  movl    %eax, %edx
  movq    -8(%rbp), %rax
  movl    %edx, 4(%rax)     # move x-y into %rax offset by 4 bytes
  movq    -8(%rbp), %rax
  movl    -12(%rbp), %edx
  movl    %edx, 8(%rax)     # move x into %rax offset by 8 bytes
  nop
  movq    -8(%rbp), %rax    # return the address of %rax
  popq    %rbp              # unwind the stack
  .cfi_def_cfa 7, 8
  ret
  .cfi_endproc

Clearly, even in the unoptimized version, this function expects a Foo to have been allocated by the calling site, with an address supplied by %rdi. So don’t be afraid to write code that does this. As an exercise, rewrite the function to take a reference to a Foo as a parameter and look at the assembly it generates.

The optimized assembly looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
_Z8make_fooii:
.LFB0:
  .cfi_startproc
  leal    (%rsi,%rdx), %ecx # compute %rsi + %rdx and store the result in %ecx
  movq    %rdi, %rax
  movl    %esi, 8(%rax)     # store x in the right spot
  movl    %ecx, (%rdi)      # store x + y in the right spot
  movl    %esi, %edi
  subl    %edx, %edi        # compute x - y
  movl    %edi, 4(%rax)     # store x - y in the right spot
  ret
  .cfi_endproc

There are a few more tricks to be sure, but we can sort of imagine how the unoptimized assembly could be “fuzed” into this one. The best way to understand how an optimizer works is to realize that by preserving invariants, the compiler can coalesce many instructions into one, and also elide needless register usage. All fusion operations are provably correct so the semantics of the program won’t change.

Conclusion

At this point, hopefully, you are more comfortable moving around in assembly and understanding what the compiler does. Also, the methodology used here to essentially reverse engineer the compiler is a perfectly valid way to learn how things work on your own. You should have some intuition on how optimizations work, and maybe even have an inkling for what to look for when things go wrong.

Future installments of this series will look at the C++ runtime more closely, in addition to other languages to compare and contrast. In addition, we’ll look at how JIT compilers work. Finally, we’ll try to go a bit more in depth into how these seemglingly magical optimizations occur. As always, please leave any feedback, criticisms, corrections, or comments below.

Using Ref-qualifiers

Most C++ programmers who have drank the C++11 koolaid are already familiar with rvalue notation. Namely, specifying that a particular function overload acts on rvalues. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
struct foo {
    void bar(int &) {
        std::cout << "l";
    }
    void bar(int &&) {
        std::cout << "r";
    }
};

int main() {
    int i = 1;
    foo f;
    f.bar(i);
    f.bar(3);
}

will output “lr”. There is a variation of this overload that applies to the object itself known as the ref-qualifier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct foo {
    void bar() & {
        std::cout << "l";
    }
    void bar() && {
        std::cout << "r";
    }
};

int main() {
    foo f;
    f.bar();
    foo{}.bar();
}

This will also output “lr” because the compiler recognizes that f is an lvalue but foo{} is an rvalue (specifically, an expiring value). So now that we know what ref-qualifiers are, why should we use them?

Traversing a DOM like structure

The example I will discussing is a Selector object that can traverse the DOM. I chose the DOM because it is a fairly well-known construct in the programming world. For the uninitiated, the DOM is essentially a tree with named nodes. For the sake of this example, we’ll assume that nodes of a particular name are unique (only ids exist; no classes). Also, we’ll assume that everything is a div for simplicity although this can easily be extended to handle different types of elements. For example:

1
2
3
4
5
6
<div id="1">
  <div id="2a">
    <div id="3"></div>
  </div>
  <div id="2b"></div>
</div>

The structure of this document looks like

    1
   / \
  2a  2b
 /
3

One way of allowing a user to traverse this is by chaining names. For example:

1
2
3
Selector s(document);
s["1"]["2a"]["3"] = "hello"; // access div with id "3" in div "2a"
s["1"]["2b"] = "world";

Of course, we’d also want to allow the user to cache a selector for faster access.

1
2
auto root = s["1"];
root["2a"] = "foo";

This interface is friendly and adheres to the principle of least astonishment (especially because it resembles common javascript APIs). An important constraint for the problem here is that we are not allowed to store pointers within the tree. This may seem like an artificial constraint. However, this sort of thing could happen if the DOM (or whatever it is) is in some other black box module that doesn’t permit such references (due to mutating state and what have you).

A basic implementation might look like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Selector {
private:
    DOM &_doc;
    std::vector<std::string> _traversal;

    void _traverse() {
        for (const auto &elem : _traversal) {
            _doc.Get(elem);
        }
    }

public:
    Selector(DOM &doc) : _doc(doc) {}
    Selector operator[](const std::string &id) const {
        Selector s(*this);
        s._traversal.push_back(id);
        return s;
    }
    void operator=(const std::string &value) {
        _traverse();
        _doc.Set(value);
    }
};

This implementation assumes methods DOM::Get(std::string &id) and DOM::Set(std::string &value). To traverse the DOM, we store a vector of ids that we can traverse with Selector::_traverse(). Note that the way we allow the operator to be chained is by returning new copies of Selector. If we modified it, the user wouldn’t be able to cache a selector chain and would have to rebuild the vector of ids each time.

The problem

Obviously, the implementation as is isn’t ideal. To see this, consider the following chain:

1
2
Selector one = s["1"];
one["2"]["3"]["4"] = 3;

Here, the user caches a selector to the “1”-id div. Afterwards, the user attempts to traverse from “1” to “4” and passes through two intermediate nodes. The intermediate selections will generate unecessary copies of Selector that get thrown away immediately! Of course, we could decide to have operator[] modify the Selector in place and forbid caching. But this could create weird side-effects that would be exposed to the user.

1
2
3
Selector s{document};
s["1"]["2a"] = "hi";
s["1"]["2b"] = "there";

Here, it is clear what the user is trying to do, but after the first traversal, s has now been mutated so the _traversal vector now contains “1” and “2a”, and the second invocation will fail. Sure, we could provide the user with a means to “reset” the selector but this is also error prone and doesn’t make for a good experience.

A better way… with ref-qualifiers!

It turns out, that the information we really need whether or not the selector we are applying operator[] to is something the user wants to save, or something that will be thrown away. But this is precisely what ref-qualifiers allow us to do!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class Selector {
private:
    DOM &_doc;
    std::vector<std::string> _traversal;

    void _traverse() {
        for (const auto &elem : _traversal) {
            _doc.Get(elem);
        }
    }

public:
    Selector(DOM &doc) : _doc(doc) {}

    // This is new
    Selector operator[](const std::string &id) const & {
        Selector s(*this);
        s._traversal.push_back(id);
        return s;
    }
    Selector &&operator[](const std::string &id) && {
        _traversal.push_back(id);
        return std::move(*this);
    }
    // End new stuff

    void operator=(const std::string &value) {
        _traverse();
        _doc.Set(value);
    }
};

Here is the new implementation. We now have now overloaded operator[] to handle the cases when *this refers to an lvalue and an rvalue respectively and changed nothing else.

Now, when we revisit the snippet from before…

1
2
Selector one = s["1"];
one["2"]["3"]["4"] = 3;

the evaluation is completely different. The variable one is bound to a name and is an lvalue. So one["2"] will produce a new temporary selector using the operator[](const std::string &) const & overload. The next chain, however, applies to a temporary so the operator[](const std::string &) && overload is called instead and modifies the temporary Selector instead. This proceeds all the way to the end and we get the desired result.

Use of std::move

My first implementation of this code by hand actually had a bug in it! Namely, the code

1
2
3
4
Selector &operator[](const std::string &id) && {
    _traversal.push_back(id);
    return *this;
}

doesn’t return an rvalue but an lvalue. This is because even though *this referred to an rvalue, once it entered the scope of this function, it got a name, and thus became an rvalue. This happens with function arguments too:

1
2
3
4
5
6
7
void foo(int &) {
    std::cout << "l" << std::endl;
}
void foo(int &&i) {
    std::cout << "r" << std::endl;
    foo(i);
}

Calling foo(3) here will output “rl” because the second overload of foo is reached first, but the value 3 is then bound to i and so becomes an lvalue. Thanks to Serge Pavlovsky for pointing out my error!

Conclusion

So now we’ve had our cake and eaten it too. In general, having the ability to distinguish between an unnamed temporary (rvalue) or an lvalue permits the programmer to exploit optimizations such as this one (performing operations in place vs creating a pristine “workplace”). This is a common theme with value categories in general so be on the lookout for such optimizations where they make sense.

This technique came into play when implementing something similar to the DOM selector here in a project called Selene. You can see the actual code in the file include/Selector.h.

Interfacing Lua With Templates in C++11 Conclusion

The code for this project is hosted here

This is a final part of a 3 part series: – part 1part 2part 3

This is the final writeup on my personal attempt to abstract the interface between Lua and modern C++11. The first part discussed the implementation of calling Lua functions from C++. The second part tackled the inverse problem, binding and calling C++ functions from Lua. To conclude the series, I will discuss the approach I take to binding C++ classes to Lua. Because many of the template metaprogramming methods were already discussed in the other two parts, this will be the shortest one of the series.

Lua as a language does not have objects built-in. Rather, it provides a powerful abstraction via tables and metatables. While it is possible to come up with elaborate schemes for mapping objects to Lua tables to support polymorphism, inheritance, and other OO concepts, this would likely require the Selene library to be bundled with a Lua package which is not the goal of the project (currently). As such, the implementation discussed here and provided by Selene opts for the simplest mapping possible. The hope is that this functionality will be sufficient for any user of Selene to implement whatever abstraction they may need on top of it with relative ease.

Following the pattern of the previous posts, we will first consider what the interface will look like as a guide.

Lua-C++-Object Interface

To keep things simple, we’ll allow the user to register specific instances of C++ objects. If the user wishes to instantiate C++ objects from Lua, it is simple to expose a C++ function to do so. Thus, we will not handle construction/destruction and C++ object lifetime which may unecessarily complicate usage of the library. The class we will try to bind to Lua is the following:

1
2
3
4
5
6
7
struct Foo {
    int x;
    Foo(int x_) : x(x_) {}

    int DoubleAdd(int y) { return 2 * (x + y); }
    void SetX(int x_) { x = x_; }
};

C++ does not support any sort of class reflection so we have no choice but to have the user specify which member functions are to be exposed. This isn’t too problematic as the user may want to name the functions differently in the Lua context anyways.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Foo foo(0); // Create an instance of Foo with x initialized to 0

// Create a context and register the instance
sel::State state;
state.Register("foo", foo,
               "double_add", &Foo::DoubleAdd,
               "set_x", &Foo::SetX);

// Change foo.x from 0 to 4 from Lua
state.CallField("foo", "set_x", 4);
assert(foo.x == 4);

// Call Foo::DoubleAdd on instance foo
int result = state.CallField<int>("foo", "double_add", 3);
assert(result == 14);

The Register function is overloaded to accept an name to bind an instance to in Lua, a reference to the instance, and a variadic list of alternating function names and pointers.

Implementation

Tackling member functions

The first thing we need is a sel::ClassFun object which allows us to call member functions from Lua. The sel::ClassFun class is actually very similar to the sel::Fun from part 2, the only distinction being that instead of being bound to a global, the sel::ClassFun is bound to a field of our instance table. The syntax for doing this is lua_setfield(l, index, name) where index is the position on the stack where the table resides and name is the name of the field we wish to bind the function to. Recall that the way sel::Fun worked was by binding a closure with a pointer to the sel::Fun instance as an upvalue. The same approach works here.

Next, we’ll need a sel::Class object to contain the instances of sel::ClassFun. The skeleton of the class looks like:

Class.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
namespace sel {
struct BaseClass {
    virtual ~BaseClass() {}
};

template <typename T, typename... Funs>
class Class : public BaseClass {
private:
    std::vector<std::unique_ptr<BaseFun>> _funs;

public:
    // TODO: constructor
};
}

To implement the member function registration, let’s start from the bottom by providing a function which registers only a single function to sel::Class.

1
2
3
4
5
6
7
8
9
10
11
12
13
template <typename Ret, typename... Args>
void _register_fun(lua_State *state,
                   T *t,
                   const char *fun_name,
                   Ret(T::*fun)(Args...)) {
    std::function<Ret(Args...)> lambda = [t, fun](Args... args) {
        return (t->*fun)(args...);
    };
    constexpr int arity = detail::_arity<Ret>::value;
    _funs.emplace_back(
        new ClassFun<arity, Ret, Args...>
        {state, std::string{fun_name}, lambda});
}

The usage of the lambda allows us to convert a member function to a non-member function by binding it to the pointer to the instance, t. This is what allows us to leverage existing code used to describe sel::Fun. The detail::_arity struct is simply a type trait struct which looks at the return type to deduce how many values the function is expected to return (0 in case of void, 1 in case of a single value, or N in case of a size-N tuple). Its definition is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
namespace sel {
namespace detail {

template <typename T>
struct _arity {
    static contexpr int value = 1;
};

template <typename... Vs>
struct _arity<std::tuple<Vs...>> {
    static contexpr int value = sizeof...(Vs);
};

template <>
struct _arity<void> {
    static contexpr int value = 0;
};

}
}

Since the user may opt to bind multiple member functions, we’ll need to provide a function that recusively unpacks the arguments while making call to sel::Class::_register_fun. Call it sel::Class::_register_funs for simplicity.

1
2
3
4
5
6
7
8
9
10
// base case
void _register_funs(lua_State *state, T *t) {}

template <typename F, typename... Fs>
void _register_funs(lua_State *state, T *t,
                    const char *name, F fun,
                    Fs... funs) {
    _register_fun(state, t, name, fun);
    _register_funs(state, t, name, fun);
}

This is a slightly different pattern from what we’ve seen before. Notice that the function will recursively peel of two arguments at a time from the parameter pack, expecting the pack to be of the form const char * and F alternating. Code that attempts to call this function with an odd number of arguments will fail to compile.

With the _register_funs function, implementing the constructor of sel::Class is fairly trivial, requiring a call to lua_createtable and a call to _register_funs.

Class registration

The only thing left is to actually register the class with our sel::State object. Because the function requires a reference to the object instance, we can overload our existing sel::State::Register method.

1
2
3
4
5
6
7
8
template <typename T, typename... Funs>
void Register(const std::string &name, T &t,
              Funs... funs) {
    Unregister(name); // ensures the name is not used elsewhere
    auto tmp = std::unique_ptr<BaseClass>(
        new Class<T, Funs...>{_l, &t, name, funs...});
    _objs.insert(std::make_pair(name, std::move(tmp)));
}

An interesting point to note is that in declaring a new instance of sel::Class, the type arguments T and Funs... needed to be passed explicitly because template arguments cannot be deduced from arguments passed to the constructor.

Conclusion

And that’s a wrap for this series! If you’ve been following along, you now have a reasonably complete understanding of how Selene is implemented, sans a few more abstractions that have made the implementation easier in some cases. The library is still a work in progress, but it is slim enough that you should feel comfortable using it, experimenting with it, and perhaps adding your own functionality. The main takeaways are that clean interfaces can take a lot of work but they can be are absolute time-savers in the long wrong by abstracting away much unecessary boilerplate. If you come across similar boilerplate in the wild, hopefully, some of the tools and patterns presented in this series will help you. As always, feel free to leave a comment or question.

Interfacing Lua With Templates in C++11 Continued

The code for this project is hosted here

This is the second part of a 3 part series: – part 1part 2part 3

If you’ve read the previous post on interfacing Lua with templates, you saw how variadic templates allowed us to avoid a lot of boilerplate in dealing with the Lua stack when calling Lua functions from C++. We started by imagining an ideal interface and came up with something like lua_state.Call<int>("add", 3, 5) to call a 2-arity Lua function bound to "add" and return the result. The Call function was variadic in both the return type and the arguments. To continue our development of this library (which I am calling Selene), we need to decide how to bind C++ functions to Lua.

Envisioning the interface

First, let’s see how Lua expects us to register a C function.

without_selene.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* include necessary Lua headers
    ....
*/

int my_add(lua_State *l) {
    float x = luaL_checknumber(l, 1) // reads a value from the stack,
                                     // verifying it is a number
    float y = luaL_checknumber(l, 2)
    const float result = x + y;
    lua_pushnumber(l, result);
    return 1; // return the arity of this function
}

int main() {
    // open a lua context
    lua_State *l = luaL_newstate();

    // bind the function pointer to global "c_my_add"
    lua_register(l, "c_my_add", &my_add);
}

This doesn’t look too bad, but adhering to this interface is certainly not ideal. Every function we write that can be passed to lua_register must be of the form int (*)(lua_State *l) which is aliased to lua_CFunction. In addition, each function we write to expose to Lua in this way must manually push and read from the stack, in addition to checking types in both directions. Also, because the Lua API requires us to pass in a C style function pointer, we can’t use the newschool std::function, lambdas, or traditional functor objects.

What this amounts to is boilerplate city! What could we envision this code looking like? Personally, I fancy something resembling the following:

with_selene.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
float my_add(float x, float y) {
    return x + y;
}

int main() {
    sel::State state;

    // Register function pointer
    state.Register("my_add", &my_add);
    float answer = state.Call<float>("my_add", 20, 22);
    assert(answer == 42);

    // Register lambda
    const std::string greeting = "hello";
    std::function<std::string(void)>> fun = [greeting]() {
        return greeting;
    }
    state.Register("greet", fun);
    assert("greet" == state.Call<std::string>("greet"));
}

Much better! With an interface like this, we don’t have to worry about interacting with the Lua stack at all. This allows our function signatures to have actual return types and argument types as well. Also, we can register functions with state, like the lambda in the above example or with an std::function created from a functor object. So now we just need to implement it. Hold on tight!

Design Discussion

Because we want our registered functions to possibly capture state as closures or be temporaries even, we’ll want to create a new class that encapsulates it. Let’s call it sel::Fun. For the implementation, we note that using lua_register which in turn calls lua_pushcfunction is a bit tricky because we would need to pass a function pointer with signature int (*)(lua_State *). If we use a lot of template magic, we could create such a function if the actual worker function was known at compile time. However, this would not allow us to use functor objects, lambdas, and the like. Note that std::function is not compatible with ANSI C function pointers and attempting to cast them as such leads to madness (I actually tried this. I’m telling you, don’t try it. But if you figure it out please talk to me).

Fortunately, Lua allows us to pass what is called a C closure. That is, in addition to passing the function, we specify and push to the stack a number of upvalues which each invocation of the function will have access to later. Here is an example of this:

upvalues.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int my_counter(lua_State *l) {
    int val = lua_tointeger(l, lua_upvalueindex(1)) // gets the first upvalue
    lua_pushinteger(l, ++val); // return the incremented value
    lua_copy(l, -1, lua_upvalueindex(1)); // update the upvalue
    return 1; // return the arity of this function
}

int main() {
    // open a lua context
    lua_State *l = luaL_newstate();

    lua_pushinteger(l, 0); // initialize counter to 1
    // push the closure with a single upvalue
    lua_pushcclosure(l, &my_counter, 1);
    // name the clsoure
    lua_setglobal(l, "my_counter");
}

With this, we come up with the following strategy:

  1. Accept anything that is a function pointer or instance of std::function.
  2. Create an instance of sel::Fun to house the callable entity, templatized by the return type and arguments of the passed function (call it _fun).
  3. In the constructor of sel::Fun, create a C closure with the address of this object as the upvalue and a static dispatch function as the C closure (call it _lua_dispatcher).
  4. Implement in sel::Fun an Apply method which will be called from _lua_dispatcher
  5. Within the Apply method use variadic templates and template metaprogramming to retrieve arguments from the stack, call _fun, the original function, and push the returned values back onto the stack.

There are a number of other implementation details regarding sel::Fun ownership, the unregistration of functions, properly handling destruction of the parent sel::State object, and more but this should keep us occupied for now.

The dispatcher

The implementation of the dispatcher is very simple.

1
2
3
4
5
6
7
8
namespace sel {
namespace detail {
int _lua_dispatcher(lua_State *l) {
    BaseFun *fun = (BaseFun *)lua_touserdata(l, lua_upvalueindex(1));
    return fun->Apply(l);
}
}
}

The dispatcher retrieves the upvalue which is a pointer to sel::BaseFun which sel::Fun inherits from and invokes the Apply method. Note that the lua_State pointer is created each time this function gets called so it must be threaded through.

The sel::Fun object

Because we expect sel::Fun to be a template class, we should create a base class to allow us to create containers of sel::Funs and for _lua_dispatcher to work. Assume that everything that follows is part of the sel namespace.

Fun.h
1
2
3
4
struct BaseFun {
    virtual ~BaseFun() {}
    virtual int Apply(lua_State *l) = 0;
};

This is our barebones abstract class. Now for the main star:

Fun.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
template <int N, typename Ret, typename... Args>
class Fun : public BaseFun {
private:
    using _fun_type = std::function<Ret(Args...)>;
    _fun_type _fun;
    std::string _name;
    lua_State **_l; // Used for destruction

public:
    Fun(lua_State *&l,
        const std::string &name,
        Ret(*fun)(Args...)) : Fun(l, name, _fun_type{fun}) {}
    Fun(lua_State *&l,
        const std::string &name,
        _fun_type fun) : _fun(fun), _name(name), _l(&l) {

        // Add pointer to this object to the closure
        lua_pushlightuserdata(l, (void *)static_cast<BaseFun *>(this));

        // Push our dispatcher with the above upvalue
        lua_pushcclosure(l, &detail::_lua_dispatcher, 1);

        // Bind it to the specified name
        lua_setglobal(l, name.c_str());
    }
    Fun(const Fun &other) = delete;
    Fun(Fun &&other)
        : _fun(other._fun),
          _name(other._name),
          _l(other._l) {
        *other._l = nullptr;
    }
    ~Fun() {
        // This destructor will unbind the name in lua, provided that
        // this object has not been moved and the parent context is
        // still alive (hence the pointer to the pointer)
        if (_l != nullptr && *_l != nullptr) {
            lua_pushnil(*_l);
            lua_setglobal(*_l, _name.c_str());
        }
    }

    // TODO implement Apply
};

This is a big chunk of code to take in all at once but hopefully the inlined comments alleviate some of the potential confusion. The important bit is in the constructor where we pass our generic dispatcher along with the pointer as a closure to lua. The remaining bits are implementation details that take care of what happens if the parent state is destroyed or if this function object is itself moved or destroyed. All that’s left is the Apply method.

Implementing sel::Fun::Apply

The order of operations we’ll need to handle are:

  1. Read all arguments from the stack into a tuple of type std::tuple<Args...>
  2. Call the member function _fun with those arguments by converting the tuple back into a parameter pack.
  3. Push the returned value(s) onto the stack.
  4. Return the number of values returned so Lua knows how many values to pop on its end.

First, let’s implement a helper function called _get_args. We could use recursion like I did in the last post, but I’ll use a different technique here since it’s arguably faster (both at compile and runtime). My implementation of Pop<std::tuple<T...>> has also changed but I will leave the other post unchanged since the type recursion technique is a good one to know.

The technique we will use is a common pattern called tag dispatching, where we pass an empty struct whose type encodes important information. In this case, we want to bind to the nth element of the tuple, the nth element on the stack (counting from the bottom). This, we’d love to expose to our function something with a type containing a parameter pack of the form <std::size_t... N>. There are many ways to do this so what follows is just what I find easiest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
namespace detail {
// Our "tag"
template <std::size_t... Is>
struct _indices {};

// Recursively inherits from itself until...
template <std::size_t N, std::size_t... Is>
struct _indices_builder : _indices_builder<N-1, N-1, Is...> {};

// The base case where we define the type tag
template <std::size_t... Is>
struct _indices_builder<0, Is...> {
    using type = _indices<Is...>;
}
}

We can create a dummy struct of type _indices<0, 1, 2, 3> for example, by doing typename _indices_builder<4>::type(). The typename keyword is required because scoped entities are always assumed to be values by default. Note also that our index tag is 0-indexed. Now, we can implement our _get_args function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
namespace detail {
template <typename... T, std::size_t... N>
std::tuple<T> _get_args(lua_State *l, _indices<N...>) {
    // Assume existence of function _check_get<T>(l, i) which will
    // read the ith element in the stack, ensure it is of type T, and
    // return it
    return std::make_tuple(_check_get<T>(l, N+1)...);
}

template <typename... T>
std::tuple<T...> _get_args(lua_State *l) {
    constexpr std::size_t num_args = sizeof...(T);
    return _get_args<T...>(l, typename _indices_builder<num_args>::type());
}
}

Notice the N+1 in the make_tuple expansion to account for the fact that Lua stacks are 1-indexed from the bottom. Now that we have a tuple with all the correct arguments read from the stack, we can call the function. Unfortunately, we have a tuple instead of a parameter pack which we need to call the funtion. Let’s write a function that can do that conversion for us.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
namespace detail {
template <typename Ret, typename... Args, std::size_t... N>
Ret _lift(std::function<Ret(Args...)> fun,
          std::tuple<Args...> args,
          _indices<N...>) {
    return fun(std::get<N>(args)...);
}

template <typename Ret, typename... Args>
Ret _lift(std::function<Ret(Args...)> fun,
          std::tuple<Args...> args) {
    return _lift(fun, args, typename _indices_builder<sizeof...(Args)>::type());
}
}

The function std::get<int I> can be used to extract the Ith element of a tuple, so fun(std::get<N>(args)...) will correctly call the function with the contents of the tuple as a comma separated list. Also, as before, we provide an overload which will automatically dispatch the necessary index tagging. Neat!

The last step of our apply function is to push the returned arguments back to the stack. We can use the techniques at our disposal to write a function of the form detail::_push(lua_State *l, std::tuple<T...> values) and overload it for singular values but I will leave this as an exercise to the reader. Finally, we can piece everything together and write sel::Fun::Apply:

1
2
3
4
5
6
7
8
9
class Fun {
// ... as before
int Apply(lua_State *l) {
    std::tuple<Args...> args = detail::_get_args<Args...>(l);
    _return_type value = detail::_lift(_fun, args);
    detail::_push(l, std::forward<_return_type>(value));
    return N;
}
};

Registering the function

We still haven’t exposed our sel::Fun class to the wild yet. For now, we’ll allow the user to call Register, a new public method in sel::State with a name and callable entity. The callable entity can be function pointer or an std::function. I provide the implementation for the std::function below:

1
2
3
4
5
6
7
8
9
template <typename Ret, typename... Args>
void Register(const std::string &name,
              std::function<Ret(Args...)> fun) {
    auto tmp = std::unique_ptr<BaseFun){
        new Fun<1, Ret, Args...>{_l, name, fun}};
    // insert Fun object in a member std::map<std::string,
    // std::unique_ptr<BaseFun>>
    _funs.insert(std::make_pair{name, std::move(tmp)});
}

We’d also need to handle the case when Ret is a tuple type so we can properly pass the first template argument to Fun (the number of return values).

Wrapping up

With this, we have almost everything we need to pass arbitrary functions to Lua and we can do things like call a function from lua which calls a C function which calls a lua function, etc. See the unit tests for some concrete examples. From here, we need to add support for C modules, coroutines, and continuations. We are also still missing an abstraction for dealing with table types. However, a lot is already possible just with what is provided already because std::function is such a powerful abstraction. Feel free to follow the project and let me know if you have any comments, suggestions, or if you found any of this helpful.

Interfacing Lua With Templates in C++11

The code for this project is hosted here

This is the first part of a 3 part series: – part 1part 2part 3

Why do I need to know this?

random C++ programmer

Template programming in C++11 is often met with furrowed brows. Techniques associated with template programming are often written off as something for the experts to worry about. While integrating Lua with the game engine I’ve been writing, I came across a nifty and useful way to use some of the new template features in C++11. Other bindings to Lua exist but I opted to write my own because the others didn’t quite do what I wanted them to (mine is still a WIP, hosted on Github here). The goal of this post is to provide a use case for template metaprogramming in C++ and showcase some of the new C++11 syntax for variadic templates (templates with a variable number of arguments). Hopefully, it will motivate you to delve even deeper into metaprogramming if you haven’t already. To read this, I assume you already know what a template is and you are somewhat comfortable with C++ syntax.

The Lua C API

To understand the problem, we need to first understand the Lua C API. C (or C++) and Lua communicate through a stack. Invoking a Lua-function from C involves the following steps:

  1. Push the function name to the stack
  2. Push each function argument to the stack
  3. Invoke lua_pcall, supplying the number of arguments and the number of return values
  4. Pop the returned values from the stack

Because the Lua API is C based, it provides a different function for each type that can be pushed or popped from the stack (for example, lua_tointeger and lua_pushboolean). Performing the above steps can be really cumbersome and result in a lot of code. Suppose we had a simple Lua file that looked like:

1
2
3
function subtract_and_hello(a, b)
    return a - b, "hello"
end

This is a lua function that takes two arguments, subtracts them, and returns the difference along with a string. Calling this function with the basic syntax might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <string>
#include <cassert>

extern "C" {
#include "lua.h"
#include "lauxlib.h"
#include "lualib.h"
}

int main() {
    lua_State *l = luaL_newstate();
    luaL_dofile(l, "example.lua");

    // Push function name to the stack
    lua_getglobal(l, "subtract_and_hello");
    // Push two numbers to the stack
    lua_pushinteger(l, 1);
    lua_pushinteger(l, 3);

    // Call the function
    const int num_args = 2;
    const int num_return_values = 2;
    lua_call(l, num_args, num_return_values);

    // Note that the stack is 1-indexed from the bottom
    const int diff = lua_tointeger(l, 1);
    const std::string greeting = lua_tostring(l, 2);

    // Pop the returned values from the stack
    lua_pop(l, 2);

    // Check that things worked correctly
    assert(diff == -2 && greeting == "hello");

    // Clean up the lua context
    lua_close(l);
}

This is a lot of code to call one function! Not only that, but there are plenty of ways it can go wrong. We could have mixed up the types, or popped or pushed values in the wrong order. We may have also forgotten to pop the correct number of values at the end.

Imagining the perfect interface

Here is how I’d want to call a function in Lua. First, I’d want a C++ object to encapsulate the lua_State pointer and automatically free it when it goes out of scope. Call it State for simplicity. The improved code with our hypothetical interface might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <string>
#include <cassert>
#include "selene.h"

int main() {
    sel::State state;
    l.Load("example.lua")

    int difference;
    std::string greeting;
    std::tie(difference, greeting) = state.Call<int, std::string>("subtract_and_hello", 1, 3);

    assert(difference == -2 && greeting == "hello");
}

Cleaner, huh? All the magic happens in the line

1
std::tie(difference, greeting) = l.Call<int, std::string>("subtract_and_hello", 1, 3);

This binds a returned tuple of type std::tuple<int, std::string> to the int and std::string on the left. Note that nowhere did we need to specify the number or types of arguments or returned values. All of that was inferred from the function call itself.

Implementing sel::State

So now that we’ve envisioned a cleaner, more usable, and less error-prone interface, we need to set about implementing it. Let’s start with a shell of the class definition.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#pragma once

namespace sel {
class State {
private:
    lua_State *_l;
public:
    State() : _l(nullptr) { _l = luaL_newstate(); }
    State(const State &other) = delete;
    State &operator=(const State &other) = delete;
    State(State &&other) : _l(other._l) { other._l = nullptr; }
    ~State() { if (_l == nullptr) return; lua_close(_l); }
};
}

This gives us an object that when instantiated, will create a Lua context and automatically free it when its destructor is invoked. It also prohibits copying while permitting moving. If you aren’t familiar with move semantics, you don’t need to understand it to continue to follow along, but I highly recommend you learn that topic.

Push and Read

Now we need to be able to interface with the stack. To push to the stack, we opt for the simple approach of functions overloaded by argument.

1
2
3
void Push(const int value);
void Push(const bool value);
...

A sample implementation for one of these might look like:

1
2
3
void State::Push(int value) {
    lua_pushinteger(_l, value);
}

For reading, we can’t use the same approach because C++ does not allow function overloading based on return type (only a few languages do). Thus, we rely on a template method with signature:

1
template<T> T Read(int index) const;

To use this function, we’ll need to provide instantiations for all the basic types supported by the API. I provide two examples below.

1
2
3
4
5
6
7
8
template<>
float State::Read(int index) const {
    return lua_tonumber(_l, index);
}
template<>
bool State::Read(int index) const {
    return lua_toboolean(_l, index);
}

With these functions, we can now invoke Push<int>(4) and Read<int>(1) to push 4 to the stack and read it back.

Handling multiple values at a time

Our next step is to make it convenient to push and pop multiple values at a time. Let’s handle pushing multiple values first. To allow something like Push(1, "hi", true), we add in addition to the other push functions above, the following template function:

1
2
3
4
5
template <typename T, typename... Ts>
void Push(const T value, const Ts... values) {
    Push(value);
    Push(values...);
}

The ellipses (…) if you haven’t seen them before signify a parameter pack, making this a variadic template function. This particular function can take a variable number of arguments but must take at least one. The compiler will pattern match the first argument to this function to const T Value and it recursively calls itself until it encounters one of the specific overloads from the previous section.

Now, we can do something like

1
2
3
4
int main() {
    sel::State state;
    state.Push(1, "hi", true);
}

and the stack will contain an integer, a string, and a boolean. Now, we need to implement multi-value reads. Since usually, we want to pop these values afterwards, we’ll just implement it as a multi-value pop. This is a bit trickier because as before, we can’t rely on argument overloading since the varying types are embedded in the return value. In addition, we’d like the function to return a boxed tuple if multiple values are requested, and an unboxed normal value if a single value is requested.

To do this specialization, we’ll use a sort of intermediate struct to differentiate between the two cases. This is called a type-traits class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// General type trait struct
template <typename... Ts>
struct _pop {
    typedef std::tuple<Ts...> type;
    static type apply(State &s) {
        // todo
    }
};

// Specialization for singular type
template <typename T>
struct _pop<T> {
    typedef T type;
    static type apply(State &s) {
        // todo
    }
};

template <typename... T>
typename _pop<T...>::type Pop() {
    // Store the elements requested
    return _pop<T...>::apply(*this);
}

Notice that the return type of Pop is embedded in the _pop helper struct which the compiler chooses based on the number of template arguments passed. We use the auto keyword for convenience although _pop<T...>::type would have sufficed as well. The T... in the invocation of apply will expand out to all the template arguments that were passed in originally.

Implementing the singular case of _pop<T> is relatively straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
template <typename T>
struct _pop<T> {
    typedef T type;
    static type apply(State &s) {
        // Read the top element (negative indices count from the top)
        T ret s.Read<T>(-1);

        // Remove it from the stack
        lua_pop(s._l, 1);
        return ret;
    }
};

For the multi-valued _pop, we’ll need to use recursion to build a tuple. A first try at an implementation might look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void PopBottom<T>(State &s) {
    // Read the bottom element
    T ret = s.Read<T>(1);

    // Remove the bottom element
    lua_remove(s._l, 1);

    return ret;
}

template <typename... Ts>
struct _pop {
    typedef std::tuple<Ts...> type;
    static type apply(State &s) {
        auto ret = std::make_tuple(PopBottom<Ts>(s)...);
        lua_pop(s._l, sizeof...(Ts));
        return ret;
    }
};

However, this doesn’t work. To see why it might work at all, we note that std::make_tuple(PopBottom<Ts>(l)...) will expand to

1
std::make_tuple(PopBottom<T1>(l), PopBottom<T2>(l), ..., PopBottom<Tn>(l));

if there are n template arguments passed. While this would work if the functions were guaranteed to evaluate from left to right, the C++ standard makes no guarantees about such an evaluation order. In fact, g++ v4.8.2 will invoke each argument to make_tuple from right to left, resulting in a tuple that is the reverse of what we want! Thus, we need to use recursion to enforce the order of execution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template <typename... Ts>
struct _pop {
    typedef std::tuple<Ts...> type;

    // base case: creates a tuple containing one element
    template <typename T>
    static std::tuple<T> worker(const State &s, const int index) {
        return std::make_tuple(s.Read<T>(index));
    }

    // inductive case
    template <typename T1, typename T2, typename... Rest>
    static std::tuple<T1, T2, Rest...> worker(const State &s,
                                              const int index) {
        std::tuple<T1> head = std::make_tuple(s.Read<T1>(index));
        return std::tuple_cat(head, worker<T2, Rest...>(s, index + 1));
    }

    static type apply(State &s) {
        auto ret = worker<Ts...>(s, 1);
        lua_pop(s._l, sizeof...(Ts));
        return ret;
    }
};

Recursion is a must here since there is no way to iterate through a parameter pack. If you’ve programmed in Lisp, Haskell, Erlang, or something similar, the code above will look pretty familiar. Note that the template type signature <typename T1, typename T2, typename... Rest> is important to signify a function that has at least 2 template arguments since a parameter pack can have zero or more elements. If we used <typename T1, typename... Rest> instead, the code would fail to compile since it would have an ambiguous overload with the previous function with template type signature <typename T>.

With this, we can now pop multiple values from the stack!

1
2
3
4
5
6
7
8
9
10
11
int main() {
    sel::State state;
    state.Push(1, "hi", true);

    int a;
    std::string b;
    bool c;
    std::tie(a, b, c) = state.Pop<int, std::string, bool>();

    assert(a == 1 && b == "hi" && c);
}

Implementing Call

Finally, we can implement the Call member function to do what we initially intended.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
template <typename... Ret, typename... Args>
typename _pop<Ret...>::type Call(const std::string &fun,
                                 const Args&... args) {
    // Push function to stack
    lua_getglobal(_l, fun.c_str());

    const int num_args = sizeof...(Args);
    const int num_ret = sizeof...(Ret);

    // Push all arguments to the stack using our variadic Push
    Push(args...);

    // Call the function
    lua_call(_l, num_args, num_ret);

    // Return all the results and remove the from the stack
    return Pop<Ret...>();
}

This time, we have a mixture of variadic template arguments and function arguments. All the types and values passed after the function name will be bound to the type parameter pack Args... and value parameter pack args.... Again, we use the built-in variadic function sizeof... to determine the number of arguments and return values (sizeof... is easy to implement as an exercise by the way).

This gets us most of the way there, but Call will fail in a critical way if we try to call a Lua function with no arguments. To see why, we recognize that args... will be an empty parameter pack so

1
Push(args...);

will be invoked as

1
Push();

which we never implemented (we implemented Push taking at least one argument). To circumvent this, we can simply implement Push():

1
void Push() {}

Returning zero arguments

Astute readers will note that our implementation will fail if we try to call a function returning no results. This is because our Pop function expects to do some work and will return either an unboxed value or a tuple. Thus, we’ll need to modify the Pop function and the _pop type-trait struct to accomodate this.

There several ways to do this, but this way felt the most natural to me (feel free to comment if you think otherwise). First, we add a type parameter to _pop that denotes the number of values we expect to pop. The base struct signature becomes

1
2
3
4
template <size_t, typename... Ts>
struct _pop {
    ...
};

Our existing partial specialization occurs when this size_t is 1.

1
2
3
4
template <typename T>
struct _pop<1, T> {
    ...
};

Now, we need to add a new partial specialization for when the size argument is 0:

1
2
3
4
5
template <typename... Ts>
struct _pop<0, Ts...> {
    typedef void type;
    static type apply(State &s) {}
};

Now, our Pop function can return nothing if no types are required with a slight implementation tweak:

1
2
3
4
template <typname... T>
typename _pop<sizeof...(T), T...>::type Pop() {
    _pop<sizeof...(T), T...>::apply(*this);
}

Perfect forwarding

There is one slight defect with the implementation that I will discuss briefly but not fully outline. In this implementation, when parameter packs are forwarded from one function to another, the constness is preserved but not the lvalue/rvalue property. This isn’t too critical here since most of the types we are passing are basic types with the exception of std::string. However, later on, we may wish to pass C++ objects that represent tables and preserving lvalue/rvalue properties will be important for performance, lest we create unecessary copies everywhere.

The syntax to do perfectly forward a parameter pack looks like

1
2
3
4
template <typename... T>
void foo(T&&... args) {
    bar(std::forward<T>(args)...);
}

This way, if an lvalue is passed in, it will remain an lvalue when threaded through to bar. Similarly, if an rvalue is passed in, it will be passed as an rvalue as well.

Wrapping up

After all the type-fu above, we finally have a class that provides us the clean type-aware Call method we were looking for. As mentioned in the beginning, the code for this set of Lua bindings is still in progress but can be found on my github here. I don’t expect this overview to be at all comprehensive, but I hope at the very least that it has demonstrated to some degree what the C++11 type system is capable of. As a final exercise, I highly recommend looking at the disassembly of the executable created by the compiler whenever metaprogramming to understand the runtime characterstics of the code.

The Difficulty of Writing Fast Programs

Item 1: Writing fast programs is hard

Writing fast programs is hard because your semantics are generally useless now. Gone are niceties like dynamically managed memory, reference counted pointers, garbage collection, and any other constructs to hold your hand. Your program is going to look ugly. Polymorphism isn’t your friend. Every feature hates you. N.B. it’s not actually this bad usually, but it can certainly feel this way sometimes.

Item 2: Designing fast programs is hard

Even harder than writing the fast program, is conceiving it in the first place. You’ve just come up with this brilliant encoding scheme for your data? What do you do when your ideal word size is actually 65 bits. Came up with the perfect GPU kernel code? Sorry, it actually requires 1 extra SIMD lane than you have, or one less.

Designing fast programs is hard because you need to think at the hardware level, and you know what? Those hardware designers definitely didn’t have your problem in mind when they made their chip.

Item 3: Testing fast programs is hard

Ugh. So you’ve spent all that time tuning this stallion of a war horse program? It only works sometimes but SEGFAULTS randomly? Well why don’t you unit test better?

Oh, right. If your code is structured to be unit-testable, it’s unlikely that it’ll be blazingly fast too. Unless you’re some sort of demigod that can write testable fast code. Your code will refuse to orient itself in any unit testable fashion, and sure enough, when you refactor the world because of some cache miss in some inner loop, all your tests will commit seppuku.

Item 4: Making programs fast is hard

Your profiling can be as useful as a boy scout compass on a steel ship. Your code runs great on your machine, but flows like molasses on your friends machine (with newer hardware). You manage to make your program fast, but the inclusion of one extra element in your data set causes some anomalous branching your code does slows everything to a halt again. Worst, your program is slow because you failed at the design and the thing you optimized to death turned out to be the least significant factor.

The future of writing fast programs

My hope lies in something similar to Haskell. Give me a smart compiler that can optimize the code for me, decide how to execute it, and do the right thing. That lets me manage my own memory in a type safe way and will complain loudly if there is a possibility of leakage. That issues a warning if I may incur a cache miss because my data isn’t aligned properly. Better yet, it aligns the data for me. Give me a compiler that is largely functional but that inlines aggressively so I can have unit testable code without needing to restructure the entire program around my tests. Best of all, give me a compiler that automagically vectorizes my code and compiles the relevant pieces to GPU kernel code, all while keeping memory bandwidth in mind and suggesting changes to allow me to push more code to the GPU if I so choose.

Until I get this magical compiler, I suppose I will get cozy with C++ in the meantime.

Goodbye Quark, Hello World

I gave my notice to Quark Games Inc almost two Fridays ago with mixed feelings. While there, I had many opportunities to achieve, learn, contribute, and teach. I am unquestionably a better engineer now than I was a year ago, and I have Quark to thank for that, however, I am also optimistic about what the future holds. The post seeks to document what I’ve done in the recent past and discuss where I want to go in the future.


Cassandra Hadoop Programming

My first task as an intern was to work on a BI system built on Cassandra (using DataStax as the provider). I wrote a number of Pig/Hive map reduce jobs and because familiarized with the map reduce framework (thanks Alex Tang for spearheading this and showing me the MR ropes). I also leveraged preexisting knowledge to write a number of R scripts to generate cool visuals. Unfortunately, the system proved a bit expensive to use due to the overwhelming volume of data and inability to sample the data effectively.

Valor guild request/invite system

This was my first server feature on the Rails web server. Many thanks to Albert Wang (a product manager at the time) for helping me with my questions and familiarizing me with the codebase. Also thanks to Eric Liaw, Tyler Margison, and Juan Delgado for answering my questions when I was pestering Albert to much.

Implement a Git Rebase Workflow

Code reviews were happening at a snail’s pace due to over the shoulder reviewing. I introduced the Pull Request system (actually leveraging the distributed portion of Git as a DVCS) and the git rebase workflow. The team was very receptive and quick on the uptake.

Write Deploy Scripts for Valor

I was a fledgling at automation at the time but I knew that things should be more automated than they were. Because Valor existed across many deployed environments, we needed a way to manage and know, a priori, what code was deployed where. I began to learn more about the deployment philosophies of Engine Yard and Chef. Again, the team was quick to adopt the new methodology and there was a bit more sanity in the QA process as a result.

Valor Competitions

I implemented Valor competitions shortly afterward. This was more or less my last task as an intern before I was converted to full time as a Senior Engineer. As a feature, competitions were not very difficult but taught me a lot about the product feature cycle. Many things were pushed to last minute and compromised the timeline as a result. I also learned (again) how hard automation was. Scoring a competition in particular, was a batch job requiring tens of thousands of database queries. Writing this in a way that it could finish without f*cking everything up was quite challenging.

Map Data Encoding

Everything was JSON, and considering that the bulk of our data was small single digit numbers, you can imagine the size of our request and response bodies. Ginormous! Our company hosted an awesome hackathon (thanks Carol Liaw!) at which time I had an idea to encode our map data as binary to allow players to potentially zoom out and in to maps. The most elegant way I could think of to encode the data was in a lossless PNG, mainly because the (x, y) coordinates of the world mapped well to each pixel, and I calculated that I could store all the data I needed in four 8-bit channels. Plus, you could easily drop in an existing PNG library to query pixel colors (a fast operation). The prototype was completed thanks to Calvin Hsu, my partner in crime and it eventually found its way into production several months later (I had moved on to a different team by that time).

Prototype Valor in Erlang

Valor was plagued by scalability issues and one of my thoughts was to rewrite some of the core game logic. At the time (and still), the core logic used lazy evaluation in Ruby and required a number of lookups to succeed. This was awful for user response time, and also server costs. My main thought was to write the heavy logic in a more performant language that made it easy to evaluate things eagerly. This was my first exposure to Erlang, although I had dealt with functional languages before (Haskell, R, Lisp, ML, etc). I actually evaluated a number of other solutions for the task. The top contenders were Clojure, Scala, Java, and C++. Of these, Erlang seemed like the one that could produce results, even for a fast prototype. I managed to build a system that would easily handle the load, with no exceptions or errors for a prolonged period of time (8 hours for my first test) on a single machine!

Alas, I never got to implement this system in the wild, but the experience was valuable nonetheless. Instead of optimizing the game logic, a decision was made to optimize the database layer instead (replacing memcached instances with Couchbase). Couchbase would prove to be a horrible choice; more on this later.

Port Valor from Rails 2.3.5 (Ruby 1.8.7) to Rails 3.2.12 (Ruby 1.9.3)

This was done from a Friday morning to Monday morning. I saw the sunrise on Sunday. The change was over 10000 lines of pure hell, filled with gem incompatibilities, library changes, API changes, the works. I couldn’t even manage to get the application to boot until Saturday evening (this is after 22 hours of work). I had also worked fast on Thursday to finish things early so I could work on this “side project” on Friday without anybody missing a beat. Fortunately, things worked and the code itself was deployed a week and a half later. I was fairly pleased with this accomplishment because it is one of the few things I started that I had no idea if it would work out or not. I was fully prepared however, to utterly fail by Monday in exchange for what would hopefully be an interesting learning experience.

What did I learn? Well, first and foremost, keep up to date with the latest and greatest if you can. Having your stuff break early and often is definitely the way to go. Also, tests are great. Why? Because the codebase at the time, had ZERO tests which made making sweeping changes a nightmare. Incidentally, tests were something I began pushing for around this time. Finally, I learned to read the Rails source code. I’ve more or less sworn off from using Rails again, but understanding how a large framework was built would prove useful when it came time for me to build it myself.

Many thanks to Eric Liaw for helping me see through this ugrade. Also, thanks to Tyler Margison for helping me catch basically the only bug that happened on production (old Memcached keys were not evicted and since the namespacing rules on the client had changed, the Memcached instance was perpetually OOM).

Prototype New Games in Unity

The Rails 2 –> 3 conversion was the last thing I did for Valor. Afterwards, I was part of a 3 person team (Calvin Hsu and Timmie Wong) to prototype our next game. Coming from server land, it was fun to program on the client for once. I learned CSharp, and also why people love it so much. More importantly, I learned Unity and experienced using a game engine in a professional setting. Sure I had dabbled with Unreal, Flixel, Flashpunk, and more before, but it was different when I had other people to work with and an actual concept I was going for. Most importantly, it really solidified in my head things that I liked in a game engine, things I disliked, and things I really wanted (that didn’t exist). When I build a game engine in the future (not a question of if, when), this will definitely come in handy.

Architect Champs

Finally, I architected Champs. Taking everything that I had learned from Valor, I changed almost everything. Champs was built to have:

  1. Stateful requests/responses
  2. In-application caching (easy invalidation)
  3. Socket based communication (made presence/chat and more easy drop ins)
  4. Compact protocols (initially JSON for prototyping purpose but changed to Protocol Buffers)
  5. Full test coverage (integration and unit tests, albeit not comprehensive)
  6. Low response times (application response time means are measured in tens of microseconds)
  7. Security (SSL based encryption, bcrypt encoded passwords and tokens, session invalidation, obscured API, single session logins, email validation, etc.)
  8. Smart modularity and well organized code
  9. A CAP theorem-first approach to mitigating race conditions and ensuring consistency where consistency was important, and availability when availability was important.
  10. Live configuration updates
  11. Code hot swapping
  12. Efficient and inexpensive to run (part of this is just avoiding EC2 like the plague)

On the deployment side, I learned Chef and created a working deployment in about a week’s time with:

  1. Aggressive Linux tuning for security and performance
  2. Comprehensive firewall rules
  3. A sound failover and load balancing strategy

In the course of writing and designing Champs, I got to author a few open source libraries as well (a websocket client, and an Erlang protocol buffer library). All of this was done over a period of 9.5 months. Many thanks to John O’ Connor who joined me for the last 4.5 of those months and picked things up quickly. Also, shoutouts to Kevin Burg who made pivotal contributions in the summer alone. Many thanks also to the rest of the Champs Engineering team (Harrison Chow, Kevin Xiao, Calvin Hsu, and Timmie Wong) for being awesome to work with and putting up with my mild tantrums when some open source library I found was broken.

Conferences

While at Quark, I was allowed to attend Ricon West 2012 where I was exposed to many of the fundamentals of the CAP theorem and logical monotonicity. This conference has shaped my approach to distributed systems a great deal and am greatful for the opportunity. I am fortunate to be speaking at this year’s Ricon West, although I will not be doing it on the behalf of Quark. I expect I will learn a lot as before, and hope to contribute some back myself.

I also gave my first conference talk at the Erlang factory in SF in March. This was a rather rattling experience but I found it valuable to bounce ideas off the other speakers. I learned a lot at this conference as well and look forward to seeing the inevitable spike in Erlang usage in the future (or Elixir).


Lessons learned

This is a sort of glop of stuff I think I learned over the last year, although it’s impossible to really distill it down to one (or even a hundred) blog posts.

  • Identify CAP theorem tradeoffs early
  • Cache invalidation really is one of the hardest problems you’ll face
  • If a database company tells you they solve all of your problems for every use case, they are lying
  • If the database company still tries to convince you, move on
  • HTTP request bodies are humongous and expensive
  • Text protocols are useful ONLY if you intend on viewing them a lot and typing them by hand (e.g. a configuration file).
  • You aren’t done when the code is checked in. You’re done when live users are using your code and it works
  • Logging and monitoring are always mandatory
  • If you automate everything you think you need to automate, you haven’t automated enough
  • Putting off automation is almost always the wrong decision and inefficient in the long term (and usually even the short term)
  • Having a layer of separation between Engineering and Product is good when time is critical
  • Crunching long hours makes boring work unbearable. If the work is fun, you’re never crunching.
  • Interviewing is a very very hard problem. Getting the wrong gauge on somebody is the norm.

Bye Quark

To all the individuals I’ve worked with, it’s been a great ride and I’m sorry to go but there are things I want to do. The system is hopefully built to last and I hope you keep in touch.

Where to?

Starting next week (although I’ve gotten a small headstart), I will be working on a rendering engine with Calvin Hsu, an awesome friend + engineer (and ex-Quark employee). There is still a lot I don’t know, and I expect the experience will be extremely humbling. This is what I ultimately want, however. Perpetual challenges that make me question my abilities and strive to learn and grow more as a mathematician, scientist, and programmer. More importantly, the forefront of technology is still just beyond me, and I want to do everything I can to get there and pierce the veil and venture into the void beyond.

Stop Comparing Programming Languages With Benchmarks

If I had one gripe about the programming community, it would be the oft misguided usage of benchmarks to determine what language is THE RIGHT ONE™. Benchmarks are used to justify decisions when they should not be. Benchmarks can lead a person to the wrong decision. So, programming community, this is my humble benchmark-vention to you.

Preliminaries, C pointers

C and C++ is as fast as your going to get (and maybe Fortran) for raw computing. To understand philosophically why this is the case, consider the following:

1
2
3
int foo(int *bar) {
    return *bar;
}

All we are doing in this function is accepting a pointer to an integer, then dereferencing it (reading the value that the pointer points to), and returning it. Safe programming would dictate that before dereferencing this pointer, we should (unless we know it to be safe) compare bar with nullptr. The question is, what happens if the pointer bar was not actually pointing to something? The answer is, well, anything! The foo function may return a random number, or crash entirely (and take your program down with it). It will not generate an exception!

Well, why not? Because the mantra of C and C++ is performance; it will not do any more than the user asked it to. Here, when I implemented the foo function, I didn’t ask for a check that the pointer passed in was valid. So, it didn’t do it! Having implicit error and exception handling goes against the language philosophy, which is one of granting complete control to the user with minimal hand-holding (at runtime).

Why will higher level languages never get faster than C and C++? Well, they can’t by definition. Higher level languages are “higher level” because they provide a safety net around primitives like pointers to protect the programmer or make the code semantically easier to understand. An example of this is garbage collection. There is an intrinsic cost to providing anything beyond performing the operation itself, so C and C++ will always be the “speed of light” so to speak.

Understanding Tradeoffs

There are multiple axes at play when formulating a programming language. What types of expressions and constructs is the user restricted to (what is deemed as illegal and what is deemed as simply undefined)? ? How is memory handled? How are errors handled? Do I want my runtime to run natively or in a virtual machine? Let’s tackle each of these separately.

Permissiveness

Languages that are more permissive are either less performant, or more dangerous. Consider Haskell. Haskell is very non-permissive in that all functions written must be pure (side-effect free). Side effects are encapsulated very carefully in state monads (e.g. the I/O monad). As a result, Haskell is considered fairly performant and safe. It has sacrificed permissiveness and is not forced to make this tradeoff.

On the flip side, consider C++ and Ruby. Both languages place virtually no restriction on what the programmer can do, and they both are forced to make tradeoffs as a result. C++ does nothing to protect the user, and so exchanges performance for safety. Ruby, on the other hand, makes the decision to make everything an object and thus makes the opposite tradeoff.

Another way of thinking about this is that there is a tension between enforcing constraints at compile time versus runtime. By imposing more constraints, the compiler can opt to generate more optimized code. If there are very few constraints, these constraints must either be checked at runtime, or not checked at all (or something in between).

A permissive language is not necessarily better than a restrictive one and vice versa.

Memory management

Reference counting isn’t free. End of story. Any implementation of a reference counted pointer must necessarily maintain the count of number of references made to the object. This must be faithfully incremented each time it enters a new scope, and decremented when it leaves it. Each time the pointer leaves a scope, the runtime must check if the count is zero, at which point it either deletes it immediately or queues the deletion for the garbage collector later on.

A memory managed language is not necessarily better than a language without managed memory and vice versa.

Error handling

Some languages like Erlang choose to be completely fault tolerant. Every line of code will generate an exception and this will be captured somewhere in the process heirarchy (even without the developer making this explicit inline). This is also done in a way that isolates the error from other lightweight processes in the VM. Of course, there is a cost associated with this; there is no free lunch. The correct question to ask is, do the benefits outweight the negatives? As we saw with the foo function above, not having this sort of error handling can lead to insidious bugs in code.

A fault-tolerant language is not necessarily superior to a fault-intolerant one and vice versa.

Native vs virtual execution

There are additional tradeoffs in considering if the code must run natively or if it can run in a virtual machine. The main advantages a VM gives are concurrency primitives (that don’t map directly to O/S threads) and the ability to load code very quickly (without compilation) and potentially even at runtime (the JVM and Erlang VM are capable of this for example). The main disadvantage is that code running in a VM will be slower.

Choosing a language the right way

First, outline your requirements and answer the following questions:


What is the expected load of my system?
What can I afford to pay to support this load?
What are the uptime requirements?
Can I tolerate my application restarting upon crash?
Do I need high-level concurrency?
Is latency more important to me, or throughput?
Do I need hot reloading?
What platforms do I need to support?
How large is my engineering team?
Do I prefer one paradigm over another (functional vs OO vs procedural)?
Am I comfortable by taking on certain compile time restrictions like side effects or immutability)?
What languages do I know already?
What languages does my team know already?
Does the language have the ecosystem of libraries that supports the domain I work in that I need?
What is the timeline of my project?
What kind of hardware will I be using (memory, multicores, GPUs, etc.)?
Do I need to run on embedded devices?
What is my perceived priority and certainty for all the above points?

If at any point in answer these questions, you are tempted to pull up a benchmark of “X language vs Y language,” please stop. That benchmark is unlikely to help you because the benchmark obscures the tradeoffs made by the slower language – tradeoffs that may actually be important to you.

Instead, sort your responses in priority order. If load is in fact the most important thing, rule out all the languages that are not performant enough and continue (don’t just settle for the fastest immediately). Most of the time, you will end up choosing a language that sacrifices some amount of control for something nice, like lightweight threads or some other critical feature. Heck, you might just like working with somebody that happens to love Python and choose that as a result.

The important thing is that there is a language that is the “optimum fit” for the tradeoffs you are willing to make and you will be 100x happier choosing that language in the long run.

Stop the language wars

Just stop. And unless you actually intend on computing fibonnaci sequences in your product, stop showing me those stupid benchmarks. Also, note that no matter what your answers to the questions above are, the right answer will almost never be PHP (forgive me).

OpenCl in Action: Chapter 2

If you are reading OpenCL in Action and get stuck on section 2.3.3 (titled “Code example: testing device extensions”), it’s because there are several bugs in the code that will manifest themselves at runtime.

Line 30 contains the following line:

1
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, NULL, &num_devices);

The issue is that the third parameter indicates that this function call should read one device into the fourth parameter, which is NULL. As this call is only trying to retrieve the number of devices available, the simple fix is to replace it with:

1
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);

If you run the code now however, you’ll run into a segmentation fault. Using gdb (or your debugger of choice) and stepping through the code, you’ll find that the segfault happens on line 50:

1
2
clGetDeviceInfo(devices[i], CL_DEVICE_ADDRESS_BITS,
                sizeof(ext_data), &addr_data, NULL);

The clear bug here is that the third argument is the size of ext_data which is a 4096 character array. It should be a cl_uint (the type of addr_data). Replacing this code with:

1
2
clGetDeviceInfo(devices[i], CL_DEVICE_ADDRESS_BITS,
                sizeof(addr_data), &addr_data, NULL);

will fix the problem. I will be posting fixes like this to this blog and committing the changes to this repository.

I may also end up porting all the sample code to use the C++ bindings.

Thoughts on Using Chef

After masquerading as a DevOps engineer for the last few months (largely out of necessity), I thought it would be good to write up my thoughts on the experience as a whole. Much of this will largely be about Chef – how I chose to use Chef and my experience working with it.

Framework Complexity

If you’ve been following the ebb and flow of trends in DevOps, you’ll find that things seem to go in an out of fashion very quickly. Initially, system deploys were largely the realm of system administrators manning a mixture of shell scripts or Perl scripts. Occasionally, you’d also find instances of CFEngine in the wild. As the startup movement occurred and developers at leaner and smaller companies found themselves doing operations, many other frameworks such as Chef, Puppet, and Ansible started rising in popularity.

Initially, when I was considering the various options for deploying our stack, shell scripts or small frameworks like Babushka or Sprinkle were very appealing. Chef and Puppet in particular seemed to have a lot of dependencies and the ramp time felt long (they also both suffered from what I consider subpar documentation at the time). Fast forward a few months and I’m a happy Chef user. Why the change?

Why the framework?

When I interview candidates for system administrator or developer operations roles, I am often curious if they understand why the framework is superior in the first place. The answer lies in the fact that the dominant configuration management frameworks are declarative. For the uninitiated, declarative programming is a strange thing.

Declarative programming

In procedural programming, the programming informs the computer of a series of actions that the computer should perform. Most scripts take this form, in addition to programs written in a systems language such as C (e.g. Do this, then do that).

Conversely, the object oriented programmer informs the computer about a series of objects that act and interact with each other to accomplish the desired effect (e.g. This is a coffee maker which can brew coffee and can be cleaned).

In functional programming, the programmer defines a collection of functions that are composed and operate on inputs to produce a result (e.g. Here are instructions for taking the mean, and also computing the standard deviation).

Declarative programming is distinct from the others in that the programmer is defining not what the computer should do, but what the computer should be (e.g. There should be a file located at this path which reflects the current time).

Declarative programming is a natural fit for configuration management. Suppose for example, that I want an haproxy.cfg file to exist at /etc/haproxy/haproxy.cfg with some contents $CONTENTS. If I was to do this with shell scripts, I could just do:

1
echo $CONTENTS > /etc/haproxy/haproxy.cfg

Easy enough. But what if I also needed the permissions of that file to be 0700? I could modify the script:

1
2
echo $CONTENTS > /etc/haproxy/haproxy.cfg
chmod 0700 /etc/haproxy/haproxy.cfg

But what if that file is already created and owned by a different user? What if the file is there already and the state is correct? I wouldn’t want to repeat the operation again each time, so I’d have to write some conditional logic to check the diff against the contents I expect. Eventually, this simple script may evolve well beyond simple bash statements into a full blown program.

Imagine instead, if one could just write

1
2
3
4
5
6
file "/etc/haproxy/haproxy.cfg" do
  content @content
  mode 00700
  user "haproxy"
  group "haproxy"
end

Ignoring the specifics of the syntax, this is a declaration of what some aspect of the system should be. The entire declaration is a declared resource and it is provisioned by the underlying resource provider, in this case, the file. The file resource provider knows how to compute how the current state is inconsistent with the desired state and converge the former to the latter.

Chef

Programming in Chef then, is just a matter of building up a long list of these resources. Most of the resources will be files or packages that are provisioned by the built-in resource providers (file and package are two examples). Chef will then faithfully try to converge the system to the desired declared state, doing nothing if there is nothing to do.

Not understanding how the framework operates can create some gotcha moments. For example, suppose I wanted to have a file at the path /tmp/haproxy_last_update that contains the last time haproxy was updated (contrived I know). Since Chef lets the developer inline Ruby code in the Chef recipe, the neophyte Chef programmer may be tempted to write something like:

1
2
3
4
5
6
7
8
9
10
file "/etc/haproxy/haproxy.cfg" do
  content @content
  mode 00700
  user "haproxy"
  group "haproxy"
end
haproxy_update_time = File.mtime("/etc/haproxy/haproxy.cfg")
file "/tmp/haproxy_last_update"
  content haproxy_update_time
end

What will actually happen is that an exception will be thrown saying that "/etc/haproxy/haproxy.cfg" does not exist when File.mtime is called. If the file was already created, the contents of /tmp/haproxy_last_update would be the old update time as opposed to the new one with no exception thrown. What gives? The answer lies in how the Chef run is actually performed.

Chef Runs

Chef runs take a machine from one state to another in two phases. First, all the recipes are compiled and logic is run to produce a queue of resources that are to be applied. The issue with the usage of File.mtime above is that it will be excuted before the resource just above gets performed. The chef run will instead do the following:

  1. Push the file "/etc/haproxy/haproxy.cfg" resource onto the queue (does not actually modify the file!)
  2. Call File.mtime and store the result in a local variable
  3. Push the file "/tmp/haproxy_last_update" resource onto the queue (also does not actually modify any files)
  4. Assuming no other resources were declared, run the resource from step 1.
  5. Run the resource from step 2.

Now we see why the recipe behaved so strangely. Ruby logic embedded in recipe in this manner should assume that the machine has not left its original state since all resources are being pushed to a queue at the time the logic is invoked. So what’s the right way to write it? In this case, the right thing to do is to move the update time logic to a resource as well so that it will be invoked after the first resource.

Notifications and Subscriptions

This is running a bit long but the last pattern I wanted to discuss is that of recipe events. While it is possible to have tons of ruby logic in all the Chef recipes to determine what resources should run, or have all this logic embedded in ruby_block resources, the correct thing to use is the notification or the subscription.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
file "/etc/haproxy/haproxy.cfg" do
  content @content
  mode 00700
  user "haproxy"
  group "haproxy"
  notifies :action "ruby_block[/tmp/haproxy_last_update]", :create
end
ruby_block "/tmp/haproxy_last_update"
  block do
    haproxy_update_time = File.mtime("/etc/haproxy/haproxy.cfg")
    `echo ${haproxy_update_time} > /tmp/haproxy_last_update`
  end
  action :nothing
end

This snippet will have the second resource be pushed on the queue only if the first one was pushed to the queue. By chaining notifications, chef runs will be efficient and the recipes themselves will be lean. The best sorts of Chef recipes, in fact, are those with minimal code logic. That is because these recipes are by necessity more idiomatically declarative.

Writing your own resource providers

The last bit to learn once you get comfortable with using existing resource providers is to write your own! This will happen organically over time as you start reusing components across multiple recipes. What will probably happen is that your recipes will start as a simple list of resources. It will then expand as more complicated logic sets in, then contract afterwards as logic is refactored out into custom resource providers.

Conclusion

Having dabbled with Puppet, Ansible, and some other lighter weight frameworks, I think you can’t go wrong with Chef. There is a hefty upfront learning curve but I believe it pays dividends in the long run and the motivated individual can be up to speed in a weekend. I would recommend leveraging Opscode heavily on the free tier for learning purposes. In addition, if you cannot spare a machine to learn Chef, you can easily set up a virtual machine (I use Vagrant and test-kitchen for this).