Using Ref-qualifiers

Most C++ programmers who have drank the C++11 koolaid are already familiar with rvalue notation. Namely, specifying that a particular function overload acts on rvalues. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
struct foo {
    void bar(int &) {
        std::cout << "l";
    }
    void bar(int &&) {
        std::cout << "r";
    }
};

int main() {
    int i = 1;
    foo f;
    f.bar(i);
    f.bar(3);
}

will output “lr”. There is a variation of this overload that applies to the object itself known as the ref-qualifier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct foo {
    void bar() & {
        std::cout << "l";
    }
    void bar() && {
        std::cout << "r";
    }
};

int main() {
    foo f;
    f.bar();
    foo{}.bar();
}

This will also output “lr” because the compiler recognizes that f is an lvalue but foo{} is an rvalue (specifically, an expiring value). So now that we know what ref-qualifiers are, why should we use them?

Traversing a DOM like structure

The example I will discussing is a Selector object that can traverse the DOM. I chose the DOM because it is a fairly well-known construct in the programming world. For the uninitiated, the DOM is essentially a tree with named nodes. For the sake of this example, we’ll assume that nodes of a particular name are unique (only ids exist; no classes). Also, we’ll assume that everything is a div for simplicity although this can easily be extended to handle different types of elements. For example:

1
2
3
4
5
6
<div id="1">
  <div id="2a">
    <div id="3"></div>
  </div>
  <div id="2b"></div>
</div>

The structure of this document looks like

    1
   / \
  2a  2b
 /
3

One way of allowing a user to traverse this is by chaining names. For example:

1
2
3
Selector s(document);
s["1"]["2a"]["3"] = "hello"; // access div with id "3" in div "2a"
s["1"]["2b"] = "world";

Of course, we’d also want to allow the user to cache a selector for faster access.

1
2
auto root = s["1"];
root["2a"] = "foo";

This interface is friendly and adheres to the principle of least astonishment (especially because it resembles common javascript APIs). An important constraint for the problem here is that we are not allowed to store pointers within the tree. This may seem like an artificial constraint. However, this sort of thing could happen if the DOM (or whatever it is) is in some other black box module that doesn’t permit such references (due to mutating state and what have you).

A basic implementation might look like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Selector {
private:
    DOM &_doc;
    std::vector<std::string> _traversal;

    void _traverse() {
        for (const auto &elem : _traversal) {
            _doc.Get(elem);
        }
    }

public:
    Selector(DOM &doc) : _doc(doc) {}
    Selector operator[](const std::string &id) const {
        Selector s(*this);
        s._traversal.push_back(id);
        return s;
    }
    void operator=(const std::string &value) {
        _traverse();
        _doc.Set(value);
    }
};

This implementation assumes methods DOM::Get(std::string &id) and DOM::Set(std::string &value). To traverse the DOM, we store a vector of ids that we can traverse with Selector::_traverse(). Note that the way we allow the operator to be chained is by returning new copies of Selector. If we modified it, the user wouldn’t be able to cache a selector chain and would have to rebuild the vector of ids each time.

The problem

Obviously, the implementation as is isn’t ideal. To see this, consider the following chain:

1
2
Selector one = s["1"];
one["2"]["3"]["4"] = 3;

Here, the user caches a selector to the “1”-id div. Afterwards, the user attempts to traverse from “1” to “4” and passes through two intermediate nodes. The intermediate selections will generate unecessary copies of Selector that get thrown away immediately! Of course, we could decide to have operator[] modify the Selector in place and forbid caching. But this could create weird side-effects that would be exposed to the user.

1
2
3
Selector s{document};
s["1"]["2a"] = "hi";
s["1"]["2b"] = "there";

Here, it is clear what the user is trying to do, but after the first traversal, s has now been mutated so the _traversal vector now contains “1” and “2a”, and the second invocation will fail. Sure, we could provide the user with a means to “reset” the selector but this is also error prone and doesn’t make for a good experience.

A better way… with ref-qualifiers!

It turns out, that the information we really need whether or not the selector we are applying operator[] to is something the user wants to save, or something that will be thrown away. But this is precisely what ref-qualifiers allow us to do!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class Selector {
private:
    DOM &_doc;
    std::vector<std::string> _traversal;

    void _traverse() {
        for (const auto &elem : _traversal) {
            _doc.Get(elem);
        }
    }

public:
    Selector(DOM &doc) : _doc(doc) {}

    // This is new
    Selector operator[](const std::string &id) const & {
        Selector s(*this);
        s._traversal.push_back(id);
        return s;
    }
    Selector &&operator[](const std::string &id) && {
        _traversal.push_back(id);
        return std::move(*this);
    }
    // End new stuff

    void operator=(const std::string &value) {
        _traverse();
        _doc.Set(value);
    }
};

Here is the new implementation. We now have now overloaded operator[] to handle the cases when *this refers to an lvalue and an rvalue respectively and changed nothing else.

Now, when we revisit the snippet from before…

1
2
Selector one = s["1"];
one["2"]["3"]["4"] = 3;

the evaluation is completely different. The variable one is bound to a name and is an lvalue. So one["2"] will produce a new temporary selector using the operator[](const std::string &) const & overload. The next chain, however, applies to a temporary so the operator[](const std::string &) && overload is called instead and modifies the temporary Selector instead. This proceeds all the way to the end and we get the desired result.

Use of std::move

My first implementation of this code by hand actually had a bug in it! Namely, the code

1
2
3
4
Selector &operator[](const std::string &id) && {
    _traversal.push_back(id);
    return *this;
}

doesn’t return an rvalue but an lvalue. This is because even though *this referred to an rvalue, once it entered the scope of this function, it got a name, and thus became an rvalue. This happens with function arguments too:

1
2
3
4
5
6
7
void foo(int &) {
    std::cout << "l" << std::endl;
}
void foo(int &&i) {
    std::cout << "r" << std::endl;
    foo(i);
}

Calling foo(3) here will output “rl” because the second overload of foo is reached first, but the value 3 is then bound to i and so becomes an lvalue. Thanks to Serge Pavlovsky for pointing out my error!

Conclusion

So now we’ve had our cake and eaten it too. In general, having the ability to distinguish between an unnamed temporary (rvalue) or an lvalue permits the programmer to exploit optimizations such as this one (performing operations in place vs creating a pristine “workplace”). This is a common theme with value categories in general so be on the lookout for such optimizations where they make sense.

This technique came into play when implementing something similar to the DOM selector here in a project called Selene. You can see the actual code in the file include/Selector.h.

Interfacing Lua With Templates in C++11 Conclusion

The code for this project is hosted here

This is a final part of a 3 part series: – part 1part 2part 3

This is the final writeup on my personal attempt to abstract the interface between Lua and modern C++11. The first part discussed the implementation of calling Lua functions from C++. The second part tackled the inverse problem, binding and calling C++ functions from Lua. To conclude the series, I will discuss the approach I take to binding C++ classes to Lua. Because many of the template metaprogramming methods were already discussed in the other two parts, this will be the shortest one of the series.

Lua as a language does not have objects built-in. Rather, it provides a powerful abstraction via tables and metatables. While it is possible to come up with elaborate schemes for mapping objects to Lua tables to support polymorphism, inheritance, and other OO concepts, this would likely require the Selene library to be bundled with a Lua package which is not the goal of the project (currently). As such, the implementation discussed here and provided by Selene opts for the simplest mapping possible. The hope is that this functionality will be sufficient for any user of Selene to implement whatever abstraction they may need on top of it with relative ease.

Following the pattern of the previous posts, we will first consider what the interface will look like as a guide.

Lua-C++-Object Interface

To keep things simple, we’ll allow the user to register specific instances of C++ objects. If the user wishes to instantiate C++ objects from Lua, it is simple to expose a C++ function to do so. Thus, we will not handle construction/destruction and C++ object lifetime which may unecessarily complicate usage of the library. The class we will try to bind to Lua is the following:

1
2
3
4
5
6
7
struct Foo {
    int x;
    Foo(int x_) : x(x_) {}

    int DoubleAdd(int y) { return 2 * (x + y); }
    void SetX(int x_) { x = x_; }
};

C++ does not support any sort of class reflection so we have no choice but to have the user specify which member functions are to be exposed. This isn’t too problematic as the user may want to name the functions differently in the Lua context anyways.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Foo foo(0); // Create an instance of Foo with x initialized to 0

// Create a context and register the instance
sel::State state;
state.Register("foo", foo,
               "double_add", &Foo::DoubleAdd,
               "set_x", &Foo::SetX);

// Change foo.x from 0 to 4 from Lua
state.CallField("foo", "set_x", 4);
assert(foo.x == 4);

// Call Foo::DoubleAdd on instance foo
int result = state.CallField<int>("foo", "double_add", 3);
assert(result == 14);

The Register function is overloaded to accept an name to bind an instance to in Lua, a reference to the instance, and a variadic list of alternating function names and pointers.

Implementation

Tackling member functions

The first thing we need is a sel::ClassFun object which allows us to call member functions from Lua. The sel::ClassFun class is actually very similar to the sel::Fun from part 2, the only distinction being that instead of being bound to a global, the sel::ClassFun is bound to a field of our instance table. The syntax for doing this is lua_setfield(l, index, name) where index is the position on the stack where the table resides and name is the name of the field we wish to bind the function to. Recall that the way sel::Fun worked was by binding a closure with a pointer to the sel::Fun instance as an upvalue. The same approach works here.

Next, we’ll need a sel::Class object to contain the instances of sel::ClassFun. The skeleton of the class looks like:

Class.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
namespace sel {
struct BaseClass {
    virtual ~BaseClass() {}
};

template <typename T, typename... Funs>
class Class : public BaseClass {
private:
    std::vector<std::unique_ptr<BaseFun>> _funs;

public:
    // TODO: constructor
};
}

To implement the member function registration, let’s start from the bottom by providing a function which registers only a single function to sel::Class.

1
2
3
4
5
6
7
8
9
10
11
12
13
template <typename Ret, typename... Args>
void _register_fun(lua_State *state,
                   T *t,
                   const char *fun_name,
                   Ret(T::*fun)(Args...)) {
    std::function<Ret(Args...)> lambda = [t, fun](Args... args) {
        return (t->*fun)(args...);
    };
    constexpr int arity = detail::_arity<Ret>::value;
    _funs.emplace_back(
        new ClassFun<arity, Ret, Args...>
        {state, std::string{fun_name}, lambda});
}

The usage of the lambda allows us to convert a member function to a non-member function by binding it to the pointer to the instance, t. This is what allows us to leverage existing code used to describe sel::Fun. The detail::_arity struct is simply a type trait struct which looks at the return type to deduce how many values the function is expected to return (0 in case of void, 1 in case of a single value, or N in case of a size-N tuple). Its definition is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
namespace sel {
namespace detail {

template <typename T>
struct _arity {
    static contexpr int value = 1;
};

template <typename... Vs>
struct _arity<std::tuple<Vs...>> {
    static contexpr int value = sizeof...(Vs);
};

template <>
struct _arity<void> {
    static contexpr int value = 0;
};

}
}

Since the user may opt to bind multiple member functions, we’ll need to provide a function that recusively unpacks the arguments while making call to sel::Class::_register_fun. Call it sel::Class::_register_funs for simplicity.

1
2
3
4
5
6
7
8
9
10
// base case
void _register_funs(lua_State *state, T *t) {}

template <typename F, typename... Fs>
void _register_funs(lua_State *state, T *t,
                    const char *name, F fun,
                    Fs... funs) {
    _register_fun(state, t, name, fun);
    _register_funs(state, t, name, fun);
}

This is a slightly different pattern from what we’ve seen before. Notice that the function will recursively peel of two arguments at a time from the parameter pack, expecting the pack to be of the form const char * and F alternating. Code that attempts to call this function with an odd number of arguments will fail to compile.

With the _register_funs function, implementing the constructor of sel::Class is fairly trivial, requiring a call to lua_createtable and a call to _register_funs.

Class registration

The only thing left is to actually register the class with our sel::State object. Because the function requires a reference to the object instance, we can overload our existing sel::State::Register method.

1
2
3
4
5
6
7
8
template <typename T, typename... Funs>
void Register(const std::string &name, T &t,
              Funs... funs) {
    Unregister(name); // ensures the name is not used elsewhere
    auto tmp = std::unique_ptr<BaseClass>(
        new Class<T, Funs...>{_l, &t, name, funs...});
    _objs.insert(std::make_pair(name, std::move(tmp)));
}

An interesting point to note is that in declaring a new instance of sel::Class, the type arguments T and Funs... needed to be passed explicitly because template arguments cannot be deduced from arguments passed to the constructor.

Conclusion

And that’s a wrap for this series! If you’ve been following along, you now have a reasonably complete understanding of how Selene is implemented, sans a few more abstractions that have made the implementation easier in some cases. The library is still a work in progress, but it is slim enough that you should feel comfortable using it, experimenting with it, and perhaps adding your own functionality. The main takeaways are that clean interfaces can take a lot of work but they can be are absolute time-savers in the long wrong by abstracting away much unecessary boilerplate. If you come across similar boilerplate in the wild, hopefully, some of the tools and patterns presented in this series will help you. As always, feel free to leave a comment or question.

Interfacing Lua With Templates in C++11 Continued

The code for this project is hosted here

This is the second part of a 3 part series: – part 1part 2part 3

If you’ve read the previous post on interfacing Lua with templates, you saw how variadic templates allowed us to avoid a lot of boilerplate in dealing with the Lua stack when calling Lua functions from C++. We started by imagining an ideal interface and came up with something like lua_state.Call<int>("add", 3, 5) to call a 2-arity Lua function bound to "add" and return the result. The Call function was variadic in both the return type and the arguments. To continue our development of this library (which I am calling Selene), we need to decide how to bind C++ functions to Lua.

Envisioning the interface

First, let’s see how Lua expects us to register a C function.

without_selene.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* include necessary Lua headers
    ....
*/

int my_add(lua_State *l) {
    float x = luaL_checknumber(l, 1) // reads a value from the stack,
                                     // verifying it is a number
    float y = luaL_checknumber(l, 2)
    const float result = x + y;
    lua_pushnumber(l, result);
    return 1; // return the arity of this function
}

int main() {
    // open a lua context
    lua_State *l = luaL_newstate();

    // bind the function pointer to global "c_my_add"
    lua_register(l, "c_my_add", &my_add);
}

This doesn’t look too bad, but adhering to this interface is certainly not ideal. Every function we write that can be passed to lua_register must be of the form int (*)(lua_State *l) which is aliased to lua_CFunction. In addition, each function we write to expose to Lua in this way must manually push and read from the stack, in addition to checking types in both directions. Also, because the Lua API requires us to pass in a C style function pointer, we can’t use the newschool std::function, lambdas, or traditional functor objects.

What this amounts to is boilerplate city! What could we envision this code looking like? Personally, I fancy something resembling the following:

with_selene.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
float my_add(float x, float y) {
    return x + y;
}

int main() {
    sel::State state;

    // Register function pointer
    state.Register("my_add", &my_add);
    float answer = state.Call<float>("my_add", 20, 22);
    assert(answer == 42);

    // Register lambda
    const std::string greeting = "hello";
    std::function<std::string(void)>> fun = [greeting]() {
        return greeting;
    }
    state.Register("greet", fun);
    assert("greet" == state.Call<std::string>("greet"));
}

Much better! With an interface like this, we don’t have to worry about interacting with the Lua stack at all. This allows our function signatures to have actual return types and argument types as well. Also, we can register functions with state, like the lambda in the above example or with an std::function created from a functor object. So now we just need to implement it. Hold on tight!

Design Discussion

Because we want our registered functions to possibly capture state as closures or be temporaries even, we’ll want to create a new class that encapsulates it. Let’s call it sel::Fun. For the implementation, we note that using lua_register which in turn calls lua_pushcfunction is a bit tricky because we would need to pass a function pointer with signature int (*)(lua_State *). If we use a lot of template magic, we could create such a function if the actual worker function was known at compile time. However, this would not allow us to use functor objects, lambdas, and the like. Note that std::function is not compatible with ANSI C function pointers and attempting to cast them as such leads to madness (I actually tried this. I’m telling you, don’t try it. But if you figure it out please talk to me).

Fortunately, Lua allows us to pass what is called a C closure. That is, in addition to passing the function, we specify and push to the stack a number of upvalues which each invocation of the function will have access to later. Here is an example of this:

upvalues.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int my_counter(lua_State *l) {
    int val = lua_tointeger(l, lua_upvalueindex(1)) // gets the first upvalue
    lua_pushinteger(l, ++val); // return the incremented value
    lua_copy(l, -1, lua_upvalueindex(1)); // update the upvalue
    return 1; // return the arity of this function
}

int main() {
    // open a lua context
    lua_State *l = luaL_newstate();

    lua_pushinteger(l, 0); // initialize counter to 1
    // push the closure with a single upvalue
    lua_pushcclosure(l, &my_counter, 1);
    // name the clsoure
    lua_setglobal(l, "my_counter");
}

With this, we come up with the following strategy:

  1. Accept anything that is a function pointer or instance of std::function.
  2. Create an instance of sel::Fun to house the callable entity, templatized by the return type and arguments of the passed function (call it _fun).
  3. In the constructor of sel::Fun, create a C closure with the address of this object as the upvalue and a static dispatch function as the C closure (call it _lua_dispatcher).
  4. Implement in sel::Fun an Apply method which will be called from _lua_dispatcher
  5. Within the Apply method use variadic templates and template metaprogramming to retrieve arguments from the stack, call _fun, the original function, and push the returned values back onto the stack.

There are a number of other implementation details regarding sel::Fun ownership, the unregistration of functions, properly handling destruction of the parent sel::State object, and more but this should keep us occupied for now.

The dispatcher

The implementation of the dispatcher is very simple.

1
2
3
4
5
6
7
8
namespace sel {
namespace detail {
int _lua_dispatcher(lua_State *l) {
    BaseFun *fun = (BaseFun *)lua_touserdata(l, lua_upvalueindex(1));
    return fun->Apply(l);
}
}
}

The dispatcher retrieves the upvalue which is a pointer to sel::BaseFun which sel::Fun inherits from and invokes the Apply method. Note that the lua_State pointer is created each time this function gets called so it must be threaded through.

The sel::Fun object

Because we expect sel::Fun to be a template class, we should create a base class to allow us to create containers of sel::Funs and for _lua_dispatcher to work. Assume that everything that follows is part of the sel namespace.

Fun.h
1
2
3
4
struct BaseFun {
    virtual ~BaseFun() {}
    virtual int Apply(lua_State *l) = 0;
};

This is our barebones abstract class. Now for the main star:

Fun.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
template <int N, typename Ret, typename... Args>
class Fun : public BaseFun {
private:
    using _fun_type = std::function<Ret(Args...)>;
    _fun_type _fun;
    std::string _name;
    lua_State **_l; // Used for destruction

public:
    Fun(lua_State *&l,
        const std::string &name,
        Ret(*fun)(Args...)) : Fun(l, name, _fun_type{fun}) {}
    Fun(lua_State *&l,
        const std::string &name,
        _fun_type fun) : _fun(fun), _name(name), _l(&l) {

        // Add pointer to this object to the closure
        lua_pushlightuserdata(l, (void *)static_cast<BaseFun *>(this));

        // Push our dispatcher with the above upvalue
        lua_pushcclosure(l, &detail::_lua_dispatcher, 1);

        // Bind it to the specified name
        lua_setglobal(l, name.c_str());
    }
    Fun(const Fun &other) = delete;
    Fun(Fun &&other)
        : _fun(other._fun),
          _name(other._name),
          _l(other._l) {
        *other._l = nullptr;
    }
    ~Fun() {
        // This destructor will unbind the name in lua, provided that
        // this object has not been moved and the parent context is
        // still alive (hence the pointer to the pointer)
        if (_l != nullptr && *_l != nullptr) {
            lua_pushnil(*_l);
            lua_setglobal(*_l, _name.c_str());
        }
    }

    // TODO implement Apply
};

This is a big chunk of code to take in all at once but hopefully the inlined comments alleviate some of the potential confusion. The important bit is in the constructor where we pass our generic dispatcher along with the pointer as a closure to lua. The remaining bits are implementation details that take care of what happens if the parent state is destroyed or if this function object is itself moved or destroyed. All that’s left is the Apply method.

Implementing sel::Fun::Apply

The order of operations we’ll need to handle are:

  1. Read all arguments from the stack into a tuple of type std::tuple<Args...>
  2. Call the member function _fun with those arguments by converting the tuple back into a parameter pack.
  3. Push the returned value(s) onto the stack.
  4. Return the number of values returned so Lua knows how many values to pop on its end.

First, let’s implement a helper function called _get_args. We could use recursion like I did in the last post, but I’ll use a different technique here since it’s arguably faster (both at compile and runtime). My implementation of Pop<std::tuple<T...>> has also changed but I will leave the other post unchanged since the type recursion technique is a good one to know.

The technique we will use is a common pattern called tag dispatching, where we pass an empty struct whose type encodes important information. In this case, we want to bind to the nth element of the tuple, the nth element on the stack (counting from the bottom). This, we’d love to expose to our function something with a type containing a parameter pack of the form <std::size_t... N>. There are many ways to do this so what follows is just what I find easiest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
namespace detail {
// Our "tag"
template <std::size_t... Is>
struct _indices {};

// Recursively inherits from itself until...
template <std::size_t N, std::size_t... Is>
struct _indices_builder : _indices_builder<N-1, N-1, Is...> {};

// The base case where we define the type tag
template <std::size_t... Is>
struct _indices_builder<0, Is...> {
    using type = _indices<Is...>;
}
}

We can create a dummy struct of type _indices<0, 1, 2, 3> for example, by doing typename _indices_builder<4>::type(). The typename keyword is required because scoped entities are always assumed to be values by default. Note also that our index tag is 0-indexed. Now, we can implement our _get_args function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
namespace detail {
template <typename... T, std::size_t... N>
std::tuple<T> _get_args(lua_State *l, _indices<N...>) {
    // Assume existence of function _check_get<T>(l, i) which will
    // read the ith element in the stack, ensure it is of type T, and
    // return it
    return std::make_tuple(_check_get<T>(l, N+1)...);
}

template <typename... T>
std::tuple<T...> _get_args(lua_State *l) {
    constexpr std::size_t num_args = sizeof...(T);
    return _get_args<T...>(l, typename _indices_builder<num_args>::type());
}
}

Notice the N+1 in the make_tuple expansion to account for the fact that Lua stacks are 1-indexed from the bottom. Now that we have a tuple with all the correct arguments read from the stack, we can call the function. Unfortunately, we have a tuple instead of a parameter pack which we need to call the funtion. Let’s write a function that can do that conversion for us.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
namespace detail {
template <typename Ret, typename... Args, std::size_t... N>
Ret _lift(std::function<Ret(Args...)> fun,
          std::tuple<Args...> args,
          _indices<N...>) {
    return fun(std::get<N>(args)...);
}

template <typename Ret, typename... Args>
Ret _lift(std::function<Ret(Args...)> fun,
          std::tuple<Args...> args) {
    return _lift(fun, args, typename _indices_builder<sizeof...(Args)>::type());
}
}

The function std::get<int I> can be used to extract the Ith element of a tuple, so fun(std::get<N>(args)...) will correctly call the function with the contents of the tuple as a comma separated list. Also, as before, we provide an overload which will automatically dispatch the necessary index tagging. Neat!

The last step of our apply function is to push the returned arguments back to the stack. We can use the techniques at our disposal to write a function of the form detail::_push(lua_State *l, std::tuple<T...> values) and overload it for singular values but I will leave this as an exercise to the reader. Finally, we can piece everything together and write sel::Fun::Apply:

1
2
3
4
5
6
7
8
9
class Fun {
// ... as before
int Apply(lua_State *l) {
    std::tuple<Args...> args = detail::_get_args<Args...>(l);
    _return_type value = detail::_lift(_fun, args);
    detail::_push(l, std::forward<_return_type>(value));
    return N;
}
};

Registering the function

We still haven’t exposed our sel::Fun class to the wild yet. For now, we’ll allow the user to call Register, a new public method in sel::State with a name and callable entity. The callable entity can be function pointer or an std::function. I provide the implementation for the std::function below:

1
2
3
4
5
6
7
8
9
template <typename Ret, typename... Args>
void Register(const std::string &name,
              std::function<Ret(Args...)> fun) {
    auto tmp = std::unique_ptr<BaseFun){
        new Fun<1, Ret, Args...>{_l, name, fun}};
    // insert Fun object in a member std::map<std::string,
    // std::unique_ptr<BaseFun>>
    _funs.insert(std::make_pair{name, std::move(tmp)});
}

We’d also need to handle the case when Ret is a tuple type so we can properly pass the first template argument to Fun (the number of return values).

Wrapping up

With this, we have almost everything we need to pass arbitrary functions to Lua and we can do things like call a function from lua which calls a C function which calls a lua function, etc. See the unit tests for some concrete examples. From here, we need to add support for C modules, coroutines, and continuations. We are also still missing an abstraction for dealing with table types. However, a lot is already possible just with what is provided already because std::function is such a powerful abstraction. Feel free to follow the project and let me know if you have any comments, suggestions, or if you found any of this helpful.

Interfacing Lua With Templates in C++11

The code for this project is hosted here

This is the first part of a 3 part series: – part 1part 2part 3

Why do I need to know this?

random C++ programmer

Template programming in C++11 is often met with furrowed brows. Techniques associated with template programming are often written off as something for the experts to worry about. While integrating Lua with the game engine I’ve been writing, I came across a nifty and useful way to use some of the new template features in C++11. Other bindings to Lua exist but I opted to write my own because the others didn’t quite do what I wanted them to (mine is still a WIP, hosted on Github here). The goal of this post is to provide a use case for template metaprogramming in C++ and showcase some of the new C++11 syntax for variadic templates (templates with a variable number of arguments). Hopefully, it will motivate you to delve even deeper into metaprogramming if you haven’t already. To read this, I assume you already know what a template is and you are somewhat comfortable with C++ syntax.

The Lua C API

To understand the problem, we need to first understand the Lua C API. C (or C++) and Lua communicate through a stack. Invoking a Lua-function from C involves the following steps:

  1. Push the function name to the stack
  2. Push each function argument to the stack
  3. Invoke lua_pcall, supplying the number of arguments and the number of return values
  4. Pop the returned values from the stack

Because the Lua API is C based, it provides a different function for each type that can be pushed or popped from the stack (for example, lua_tointeger and lua_pushboolean). Performing the above steps can be really cumbersome and result in a lot of code. Suppose we had a simple Lua file that looked like:

This is a lua function that takes two arguments, subtracts them, and returns the difference along with a string. Calling this function with the basic syntax might look like this:

This is a lot of code to call one function! Not only that, but there are plenty of ways it can go wrong. We could have mixed up the types, or popped or pushed values in the wrong order. We may have also forgotten to pop the correct number of values at the end.

Imagining the perfect interface

Here is how I’d want to call a function in Lua. First, I’d want a C++ object to encapsulate the lua_State pointer and automatically free it when it goes out of scope. Call it State for simplicity. The improved code with our hypothetical interface might look like this:

Cleaner, huh? All the magic happens in the line

1
std::tie(difference, greeting) = l.Call<int, std::string>("subtract_and_hello", 1, 3);

This binds a returned tuple of type std::tuple<int, std::string> to the int and std::string on the left. Note that nowhere did we need to specify the number or types of arguments or returned values. All of that was inferred from the function call itself.

Implementing sel::State

So now that we’ve envisioned a cleaner, more usable, and less error-prone interface, we need to set about implementing it. Let’s start with a shell of the class definition.

This gives us an object that when instantiated, will create a Lua context and automatically free it when its destructor is invoked. It also prohibits copying while permitting moving. If you aren’t familiar with move semantics, you don’t need to understand it to continue to follow along, but I highly recommend you learn that topic.

Push and Read

Now we need to be able to interface with the stack. To push to the stack, we opt for the simple approach of functions overloaded by argument.

1
2
3
void Push(const int value);
void Push(const bool value);
...

A sample implementation for one of these might look like:

1
2
3
void State::Push(int value) {
    lua_pushinteger(_l, value);
}

For reading, we can’t use the same approach because C++ does not allow function overloading based on return type (only a few languages do). Thus, we rely on a template method with signature:

1
template<T> T Read(int index) const;

To use this function, we’ll need to provide instantiations for all the basic types supported by the API. I provide two examples below.

1
2
3
4
5
6
7
8
template<>
float State::Read(int index) const {
    return lua_tonumber(_l, index);
}
template<>
bool State::Read(int index) const {
    return lua_toboolean(_l, index);
}

With these functions, we can now invoke Push<int>(4) and Read<int>(1) to push 4 to the stack and read it back.

Handling multiple values at a time

Our next step is to make it convenient to push and pop multiple values at a time. Let’s handle pushing multiple values first. To allow something like Push(1, "hi", true), we add in addition to the other push functions above, the following template function:

1
2
3
4
5
template <typename T, typename... Ts>
void Push(const T value, const Ts... values) {
    Push(value);
    Push(values...);
}

The ellipses (…) if you haven’t seen them before signify a parameter pack, making this a variadic template function. This particular function can take a variable number of arguments but must take at least one. The compiler will pattern match the first argument to this function to const T Value and it recursively calls itself until it encounters one of the specific overloads from the previous section.

Now, we can do something like

1
2
3
4
int main() {
    sel::State state;
    state.Push(1, "hi", true);
}

and the stack will contain an integer, a string, and a boolean. Now, we need to implement multi-value reads. Since usually, we want to pop these values afterwards, we’ll just implement it as a multi-value pop. This is a bit trickier because as before, we can’t rely on argument overloading since the varying types are embedded in the return value. In addition, we’d like the function to return a boxed tuple if multiple values are requested, and an unboxed normal value if a single value is requested.

To do this specialization, we’ll use a sort of intermediate struct to differentiate between the two cases. This is called a type-traits class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// General type trait struct
template <typename... Ts>
struct _pop {
    typedef std::tuple<Ts...> type;
    static type apply(State &s) {
        // todo
    }
};

// Specialization for singular type
template <typename T>
struct _pop<T> {
    typedef T type;
    static type apply(State &s) {
        // todo
    }
};

template <typename... T>
typename _pop<T...>::type Pop() {
    // Store the elements requested
    return _pop<T...>::apply(*this);
}

Notice that the return type of Pop is embedded in the _pop helper struct which the compiler chooses based on the number of template arguments passed. We use the auto keyword for convenience although _pop<T...>::type would have sufficed as well. The T... in the invocation of apply will expand out to all the template arguments that were passed in originally.

Implementing the singular case of _pop<T> is relatively straightforward:

1
2
3
4
5
6
7
8
9
10
11
12
template <typename T>
struct _pop<T> {
    typedef T type;
    static type apply(State &s) {
        // Read the top element (negative indices count from the top)
        T ret s.Read<T>(-1);

        // Remove it from the stack
        lua_pop(s._l, 1);
        return ret;
    }
};

For the multi-valued _pop, we’ll need to use recursion to build a tuple. A first try at an implementation might look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void PopBottom<T>(State &s) {
    // Read the bottom element
    T ret = s.Read<T>(1);

    // Remove the bottom element
    lua_remove(s._l, 1);

    return ret;
}

template <typename... Ts>
struct _pop {
    typedef std::tuple<Ts...> type;
    static type apply(State &s) {
        auto ret = std::make_tuple(PopBottom<Ts>(s)...);
        lua_pop(s._l, sizeof...(Ts));
        return ret;
    }
};

However, this doesn’t work. To see why it might work at all, we note that std::make_tuple(PopBottom<Ts>(l)...) will expand to

1
std::make_tuple(PopBottom<T1>(l), PopBottom<T2>(l), ..., PopBottom<Tn>(l));

if there are n template arguments passed. While this would work if the functions were guaranteed to evaluate from left to right, the C++ standard makes no guarantees about such an evaluation order. In fact, g++ v4.8.2 will invoke each argument to make_tuple from right to left, resulting in a tuple that is the reverse of what we want! Thus, we need to use recursion to enforce the order of execution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template <typename... Ts>
struct _pop {
    typedef std::tuple<Ts...> type;

    // base case: creates a tuple containing one element
    template <typename T>
    static std::tuple<T> worker(const State &s, const int index) {
        return std::make_tuple(s.Read<T>(index));
    }

    // inductive case
    template <typename T1, typename T2, typename... Rest>
    static std::tuple<T1, T2, Rest...> worker(const State &s,
                                              const int index) {
        std::tuple<T1> head = std::make_tuple(s.Read<T1>(index));
        return std::tuple_cat(head, worker<T2, Rest...>(s, index + 1));
    }

    static type apply(State &s) {
        auto ret = worker<Ts...>(s, 1);
        lua_pop(s._l, sizeof...(Ts));
        return ret;
    }
};

Recursion is a must here since there is no way to iterate through a parameter pack. If you’ve programmed in Lisp, Haskell, Erlang, or something similar, the code above will look pretty familiar. Note that the template type signature <typename T1, typename T2, typename... Rest> is important to signify a function that has at least 2 template arguments since a parameter pack can have zero or more elements. If we used <typename T1, typename... Rest> instead, the code would fail to compile since it would have an ambiguous overload with the previous function with template type signature <typename T>.

With this, we can now pop multiple values from the stack!

1
2
3
4
5
6
7
8
9
10
11
int main() {
    sel::State state;
    state.Push(1, "hi", true);

    int a;
    std::string b;
    bool c;
    std::tie(a, b, c) = state.Pop<int, std::string, bool>();

    assert(a == 1 && b == "hi" && c);
}

Implementing Call

Finally, we can implement the Call member function to do what we initially intended.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
template <typename... Ret, typename... Args>
typename _pop<Ret...>::type Call(const std::string &fun,
                                 const Args&... args) {
    // Push function to stack
    lua_getglobal(_l, fun.c_str());

    const int num_args = sizeof...(Args);
    const int num_ret = sizeof...(Ret);

    // Push all arguments to the stack using our variadic Push
    Push(args...);

    // Call the function
    lua_call(_l, num_args, num_ret);

    // Return all the results and remove the from the stack
    return Pop<Ret...>();
}

This time, we have a mixture of variadic template arguments and function arguments. All the types and values passed after the function name will be bound to the type parameter pack Args... and value parameter pack args.... Again, we use the built-in variadic function sizeof... to determine the number of arguments and return values (sizeof... is easy to implement as an exercise by the way).

This gets us most of the way there, but Call will fail in a critical way if we try to call a Lua function with no arguments. To see why, we recognize that args... will be an empty parameter pack so

1
Push(args...);

will be invoked as

1
Push();

which we never implemented (we implemented Push taking at least one argument). To circumvent this, we can simply implement Push():

1
void Push() {}

Returning zero arguments

Astute readers will note that our implementation will fail if we try to call a function returning no results. This is because our Pop function expects to do some work and will return either an unboxed value or a tuple. Thus, we’ll need to modify the Pop function and the _pop type-trait struct to accomodate this.

There several ways to do this, but this way felt the most natural to me (feel free to comment if you think otherwise). First, we add a type parameter to _pop that denotes the number of values we expect to pop. The base struct signature becomes

1
2
3
4
template <size_t, typename... Ts>
struct _pop {
    ...
};

Our existing partial specialization occurs when this size_t is 1.

1
2
3
4
template <typename T>
struct _pop<1, T> {
    ...
};

Now, we need to add a new partial specialization for when the size argument is 0:

1
2
3
4
5
template <typename... Ts>
struct _pop<0, Ts...> {
    typedef void type;
    static type apply(State &s) {}
};

Now, our Pop function can return nothing if no types are required with a slight implementation tweak:

1
2
3
4
template <typname... T>
typename _pop<sizeof...(T), T...>::type Pop() {
    _pop<sizeof...(T), T...>::apply(*this);
}

Perfect forwarding

There is one slight defect with the implementation that I will discuss briefly but not fully outline. In this implementation, when parameter packs are forwarded from one function to another, the constness is preserved but not the lvalue/rvalue property. This isn’t too critical here since most of the types we are passing are basic types with the exception of std::string. However, later on, we may wish to pass C++ objects that represent tables and preserving lvalue/rvalue properties will be important for performance, lest we create unecessary copies everywhere.

The syntax to do perfectly forward a parameter pack looks like

1
2
3
4
template <typename... T>
void foo(T&&... args) {
    bar(std::forward<T>(args)...);
}

This way, if an lvalue is passed in, it will remain an lvalue when threaded through to bar. Similarly, if an rvalue is passed in, it will be passed as an rvalue as well.

Wrapping up

After all the type-fu above, we finally have a class that provides us the clean type-aware Call method we were looking for. As mentioned in the beginning, the code for this set of Lua bindings is still in progress but can be found on my github here. I don’t expect this overview to be at all comprehensive, but I hope at the very least that it has demonstrated to some degree what the C++11 type system is capable of. As a final exercise, I highly recommend looking at the disassembly of the executable created by the compiler whenever metaprogramming to understand the runtime characterstics of the code.

The Difficulty of Writing Fast Programs

Item 1: Writing fast programs is hard

Writing fast programs is hard because your semantics are generally useless now. Gone are niceties like dynamically managed memory, reference counted pointers, garbage collection, and any other constructs to hold your hand. Your program is going to look ugly. Polymorphism isn’t your friend. Every feature hates you. N.B. it’s not actually this bad usually, but it can certainly feel this way sometimes.

Item 2: Designing fast programs is hard

Even harder than writing the fast program, is conceiving it in the first place. You’ve just come up with this brilliant encoding scheme for your data? What do you do when your ideal word size is actually 65 bits. Came up with the perfect GPU kernel code? Sorry, it actually requires 1 extra SIMD lane than you have, or one less.

Designing fast programs is hard because you need to think at the hardware level, and you know what? Those hardware designers definitely didn’t have your problem in mind when they made their chip.

Item 3: Testing fast programs is hard

Ugh. So you’ve spent all that time tuning this stallion of a war horse program? It only works sometimes but SEGFAULTS randomly? Well why don’t you unit test better?

Oh, right. If your code is structured to be unit-testable, it’s unlikely that it’ll be blazingly fast too. Unless you’re some sort of demigod that can write testable fast code. Your code will refuse to orient itself in any unit testable fashion, and sure enough, when you refactor the world because of some cache miss in some inner loop, all your tests will commit seppuku.

Item 4: Making programs fast is hard

Your profiling can be as useful as a boy scout compass on a steel ship. Your code runs great on your machine, but flows like molasses on your friends machine (with newer hardware). You manage to make your program fast, but the inclusion of one extra element in your data set causes some anomalous branching your code does slows everything to a halt again. Worst, your program is slow because you failed at the design and the thing you optimized to death turned out to be the least significant factor.

The future of writing fast programs

My hope lies in something similar to Haskell. Give me a smart compiler that can optimize the code for me, decide how to execute it, and do the right thing. That lets me manage my own memory in a type safe way and will complain loudly if there is a possibility of leakage. That issues a warning if I may incur a cache miss because my data isn’t aligned properly. Better yet, it aligns the data for me. Give me a compiler that is largely functional but that inlines aggressively so I can have unit testable code without needing to restructure the entire program around my tests. Best of all, give me a compiler that automagically vectorizes my code and compiles the relevant pieces to GPU kernel code, all while keeping memory bandwidth in mind and suggesting changes to allow me to push more code to the GPU if I so choose.

Until I get this magical compiler, I suppose I will get cozy with C++ in the meantime.

Goodbye Quark, Hello World

I gave my notice to Quark Games Inc almost two Fridays ago with mixed feelings. While there, I had many opportunities to achieve, learn, contribute, and teach. I am unquestionably a better engineer now than I was a year ago, and I have Quark to thank for that, however, I am also optimistic about what the future holds. The post seeks to document what I’ve done in the recent past and discuss where I want to go in the future.


Cassandra Hadoop Programming

My first task as an intern was to work on a BI system built on Cassandra (using DataStax as the provider). I wrote a number of Pig/Hive map reduce jobs and because familiarized with the map reduce framework (thanks Alex Tang for spearheading this and showing me the MR ropes). I also leveraged preexisting knowledge to write a number of R scripts to generate cool visuals. Unfortunately, the system proved a bit expensive to use due to the overwhelming volume of data and inability to sample the data effectively.

Valor guild request/invite system

This was my first server feature on the Rails web server. Many thanks to Albert Wang (a product manager at the time) for helping me with my questions and familiarizing me with the codebase. Also thanks to Eric Liaw, Tyler Margison, and Juan Delgado for answering my questions when I was pestering Albert to much.

Implement a Git Rebase Workflow

Code reviews were happening at a snail’s pace due to over the shoulder reviewing. I introduced the Pull Request system (actually leveraging the distributed portion of Git as a DVCS) and the git rebase workflow. The team was very receptive and quick on the uptake.

Write Deploy Scripts for Valor

I was a fledgling at automation at the time but I knew that things should be more automated than they were. Because Valor existed across many deployed environments, we needed a way to manage and know, a priori, what code was deployed where. I began to learn more about the deployment philosophies of Engine Yard and Chef. Again, the team was quick to adopt the new methodology and there was a bit more sanity in the QA process as a result.

Valor Competitions

I implemented Valor competitions shortly afterward. This was more or less my last task as an intern before I was converted to full time as a Senior Engineer. As a feature, competitions were not very difficult but taught me a lot about the product feature cycle. Many things were pushed to last minute and compromised the timeline as a result. I also learned (again) how hard automation was. Scoring a competition in particular, was a batch job requiring tens of thousands of database queries. Writing this in a way that it could finish without f*cking everything up was quite challenging.

Map Data Encoding

Everything was JSON, and considering that the bulk of our data was small single digit numbers, you can imagine the size of our request and response bodies. Ginormous! Our company hosted an awesome hackathon (thanks Carol Liaw!) at which time I had an idea to encode our map data as binary to allow players to potentially zoom out and in to maps. The most elegant way I could think of to encode the data was in a lossless PNG, mainly because the (x, y) coordinates of the world mapped well to each pixel, and I calculated that I could store all the data I needed in four 8-bit channels. Plus, you could easily drop in an existing PNG library to query pixel colors (a fast operation). The prototype was completed thanks to Calvin Hsu, my partner in crime and it eventually found its way into production several months later (I had moved on to a different team by that time).

Prototype Valor in Erlang

Valor was plagued by scalability issues and one of my thoughts was to rewrite some of the core game logic. At the time (and still), the core logic used lazy evaluation in Ruby and required a number of lookups to succeed. This was awful for user response time, and also server costs. My main thought was to write the heavy logic in a more performant language that made it easy to evaluate things eagerly. This was my first exposure to Erlang, although I had dealt with functional languages before (Haskell, R, Lisp, ML, etc). I actually evaluated a number of other solutions for the task. The top contenders were Clojure, Scala, Java, and C++. Of these, Erlang seemed like the one that could produce results, even for a fast prototype. I managed to build a system that would easily handle the load, with no exceptions or errors for a prolonged period of time (8 hours for my first test) on a single machine!

Alas, I never got to implement this system in the wild, but the experience was valuable nonetheless. Instead of optimizing the game logic, a decision was made to optimize the database layer instead (replacing memcached instances with Couchbase). Couchbase would prove to be a horrible choice; more on this later.

Port Valor from Rails 2.3.5 (Ruby 1.8.7) to Rails 3.2.12 (Ruby 1.9.3)

This was done from a Friday morning to Monday morning. I saw the sunrise on Sunday. The change was over 10000 lines of pure hell, filled with gem incompatibilities, library changes, API changes, the works. I couldn’t even manage to get the application to boot until Saturday evening (this is after 22 hours of work). I had also worked fast on Thursday to finish things early so I could work on this “side project” on Friday without anybody missing a beat. Fortunately, things worked and the code itself was deployed a week and a half later. I was fairly pleased with this accomplishment because it is one of the few things I started that I had no idea if it would work out or not. I was fully prepared however, to utterly fail by Monday in exchange for what would hopefully be an interesting learning experience.

What did I learn? Well, first and foremost, keep up to date with the latest and greatest if you can. Having your stuff break early and often is definitely the way to go. Also, tests are great. Why? Because the codebase at the time, had ZERO tests which made making sweeping changes a nightmare. Incidentally, tests were something I began pushing for around this time. Finally, I learned to read the Rails source code. I’ve more or less sworn off from using Rails again, but understanding how a large framework was built would prove useful when it came time for me to build it myself.

Many thanks to Eric Liaw for helping me see through this ugrade. Also, thanks to Tyler Margison for helping me catch basically the only bug that happened on production (old Memcached keys were not evicted and since the namespacing rules on the client had changed, the Memcached instance was perpetually OOM).

Prototype New Games in Unity

The Rails 2 –> 3 conversion was the last thing I did for Valor. Afterwards, I was part of a 3 person team (Calvin Hsu and Timmie Wong) to prototype our next game. Coming from server land, it was fun to program on the client for once. I learned CSharp, and also why people love it so much. More importantly, I learned Unity and experienced using a game engine in a professional setting. Sure I had dabbled with Unreal, Flixel, Flashpunk, and more before, but it was different when I had other people to work with and an actual concept I was going for. Most importantly, it really solidified in my head things that I liked in a game engine, things I disliked, and things I really wanted (that didn’t exist). When I build a game engine in the future (not a question of if, when), this will definitely come in handy.

Architect Champs

Finally, I architected Champs. Taking everything that I had learned from Valor, I changed almost everything. Champs was built to have:

  1. Stateful requests/responses
  2. In-application caching (easy invalidation)
  3. Socket based communication (made presence/chat and more easy drop ins)
  4. Compact protocols (initially JSON for prototyping purpose but changed to Protocol Buffers)
  5. Full test coverage (integration and unit tests, albeit not comprehensive)
  6. Low response times (application response time means are measured in tens of microseconds)
  7. Security (SSL based encryption, bcrypt encoded passwords and tokens, session invalidation, obscured API, single session logins, email validation, etc.)
  8. Smart modularity and well organized code
  9. A CAP theorem-first approach to mitigating race conditions and ensuring consistency where consistency was important, and availability when availability was important.
  10. Live configuration updates
  11. Code hot swapping
  12. Efficient and inexpensive to run (part of this is just avoiding EC2 like the plague)

On the deployment side, I learned Chef and created a working deployment in about a week’s time with:

  1. Aggressive Linux tuning for security and performance
  2. Comprehensive firewall rules
  3. A sound failover and load balancing strategy

In the course of writing and designing Champs, I got to author a few open source libraries as well (a websocket client, and an Erlang protocol buffer library). All of this was done over a period of 9.5 months. Many thanks to John O’ Connor who joined me for the last 4.5 of those months and picked things up quickly. Also, shoutouts to Kevin Burg who made pivotal contributions in the summer alone. Many thanks also to the rest of the Champs Engineering team (Harrison Chow, Kevin Xiao, Calvin Hsu, and Timmie Wong) for being awesome to work with and putting up with my mild tantrums when some open source library I found was broken.

Conferences

While at Quark, I was allowed to attend Ricon West 2012 where I was exposed to many of the fundamentals of the CAP theorem and logical monotonicity. This conference has shaped my approach to distributed systems a great deal and am greatful for the opportunity. I am fortunate to be speaking at this year’s Ricon West, although I will not be doing it on the behalf of Quark. I expect I will learn a lot as before, and hope to contribute some back myself.

I also gave my first conference talk at the Erlang factory in SF in March. This was a rather rattling experience but I found it valuable to bounce ideas off the other speakers. I learned a lot at this conference as well and look forward to seeing the inevitable spike in Erlang usage in the future (or Elixir).


Lessons learned

This is a sort of glop of stuff I think I learned over the last year, although it’s impossible to really distill it down to one (or even a hundred) blog posts.

  • Identify CAP theorem tradeoffs early
  • Cache invalidation really is one of the hardest problems you’ll face
  • If a database company tells you they solve all of your problems for every use case, they are lying
  • If the database company still tries to convince you, move on
  • HTTP request bodies are humongous and expensive
  • Text protocols are useful ONLY if you intend on viewing them a lot and typing them by hand (e.g. a configuration file).
  • You aren’t done when the code is checked in. You’re done when live users are using your code and it works
  • Logging and monitoring are always mandatory
  • If you automate everything you think you need to automate, you haven’t automated enough
  • Putting off automation is almost always the wrong decision and inefficient in the long term (and usually even the short term)
  • Having a layer of separation between Engineering and Product is good when time is critical
  • Crunching long hours makes boring work unbearable. If the work is fun, you’re never crunching.
  • Interviewing is a very very hard problem. Getting the wrong gauge on somebody is the norm.

Bye Quark

To all the individuals I’ve worked with, it’s been a great ride and I’m sorry to go but there are things I want to do. The system is hopefully built to last and I hope you keep in touch.

Where to?

Starting next week (although I’ve gotten a small headstart), I will be working on a rendering engine with Calvin Hsu, an awesome friend + engineer (and ex-Quark employee). There is still a lot I don’t know, and I expect the experience will be extremely humbling. This is what I ultimately want, however. Perpetual challenges that make me question my abilities and strive to learn and grow more as a mathematician, scientist, and programmer. More importantly, the forefront of technology is still just beyond me, and I want to do everything I can to get there and pierce the veil and venture into the void beyond.

Stop Comparing Programming Languages With Benchmarks

If I had one gripe about the programming community, it would be the oft misguided usage of benchmarks to determine what language is THE RIGHT ONE™. Benchmarks are used to justify decisions when they should not be. Benchmarks can lead a person to the wrong decision. So, programming community, this is my humble benchmark-vention to you.

Preliminaries, C pointers

C and C++ is as fast as your going to get (and maybe Fortran) for raw computing. To understand philosophically why this is the case, consider the following:

1
2
3
int foo(int *bar) {
    return *bar;
}

All we are doing in this function is accepting a pointer to an integer, then dereferencing it (reading the value that the pointer points to), and returning it. Safe programming would dictate that before dereferencing this pointer, we should (unless we know it to be safe) compare bar with nullptr. The question is, what happens if the pointer bar was not actually pointing to something? The answer is, well, anything! The foo function may return a random number, or crash entirely (and take your program down with it). It will not generate an exception!

Well, why not? Because the mantra of C and C++ is performance; it will not do any more than the user asked it to. Here, when I implemented the foo function, I didn’t ask for a check that the pointer passed in was valid. So, it didn’t do it! Having implicit error and exception handling goes against the language philosophy, which is one of granting complete control to the user with minimal hand-holding (at runtime).

Why will higher level languages never get faster than C and C++? Well, they can’t by definition. Higher level languages are “higher level” because they provide a safety net around primitives like pointers to protect the programmer or make the code semantically easier to understand. An example of this is garbage collection. There is an intrinsic cost to providing anything beyond performing the operation itself, so C and C++ will always be the “speed of light” so to speak.

Understanding Tradeoffs

There are multiple axes at play when formulating a programming language. What types of expressions and constructs is the user restricted to (what is deemed as illegal and what is deemed as simply undefined)? ? How is memory handled? How are errors handled? Do I want my runtime to run natively or in a virtual machine? Let’s tackle each of these separately.

Permissiveness

Languages that are more permissive are either less performant, or more dangerous. Consider Haskell. Haskell is very non-permissive in that all functions written must be pure (side-effect free). Side effects are encapsulated very carefully in state monads (e.g. the I/O monad). As a result, Haskell is considered fairly performant and safe. It has sacrificed permissiveness and is not forced to make this tradeoff.

On the flip side, consider C++ and Ruby. Both languages place virtually no restriction on what the programmer can do, and they both are forced to make tradeoffs as a result. C++ does nothing to protect the user, and so exchanges performance for safety. Ruby, on the other hand, makes the decision to make everything an object and thus makes the opposite tradeoff.

Another way of thinking about this is that there is a tension between enforcing constraints at compile time versus runtime. By imposing more constraints, the compiler can opt to generate more optimized code. If there are very few constraints, these constraints must either be checked at runtime, or not checked at all (or something in between).

A permissive language is not necessarily better than a restrictive one and vice versa.

Memory management

Reference counting isn’t free. End of story. Any implementation of a reference counted pointer must necessarily maintain the count of number of references made to the object. This must be faithfully incremented each time it enters a new scope, and decremented when it leaves it. Each time the pointer leaves a scope, the runtime must check if the count is zero, at which point it either deletes it immediately or queues the deletion for the garbage collector later on.

A memory managed language is not necessarily better than a language without managed memory and vice versa.

Error handling

Some languages like Erlang choose to be completely fault tolerant. Every line of code will generate an exception and this will be captured somewhere in the process heirarchy (even without the developer making this explicit inline). This is also done in a way that isolates the error from other lightweight processes in the VM. Of course, there is a cost associated with this; there is no free lunch. The correct question to ask is, do the benefits outweight the negatives? As we saw with the foo function above, not having this sort of error handling can lead to insidious bugs in code.

A fault-tolerant language is not necessarily superior to a fault-intolerant one and vice versa.

Native vs virtual execution

There are additional tradeoffs in considering if the code must run natively or if it can run in a virtual machine. The main advantages a VM gives are concurrency primitives (that don’t map directly to O/S threads) and the ability to load code very quickly (without compilation) and potentially even at runtime (the JVM and Erlang VM are capable of this for example). The main disadvantage is that code running in a VM will be slower.

Choosing a language the right way

First, outline your requirements and answer the following questions:


What is the expected load of my system?
What can I afford to pay to support this load?
What are the uptime requirements?
Can I tolerate my application restarting upon crash?
Do I need high-level concurrency?
Is latency more important to me, or throughput?
Do I need hot reloading?
What platforms do I need to support?
How large is my engineering team?
Do I prefer one paradigm over another (functional vs OO vs procedural)?
Am I comfortable by taking on certain compile time restrictions like side effects or immutability)?
What languages do I know already?
What languages does my team know already?
Does the language have the ecosystem of libraries that supports the domain I work in that I need?
What is the timeline of my project?
What kind of hardware will I be using (memory, multicores, GPUs, etc.)?
Do I need to run on embedded devices?
What is my perceived priority and certainty for all the above points?

If at any point in answer these questions, you are tempted to pull up a benchmark of “X language vs Y language,” please stop. That benchmark is unlikely to help you because the benchmark obscures the tradeoffs made by the slower language – tradeoffs that may actually be important to you.

Instead, sort your responses in priority order. If load is in fact the most important thing, rule out all the languages that are not performant enough and continue (don’t just settle for the fastest immediately). Most of the time, you will end up choosing a language that sacrifices some amount of control for something nice, like lightweight threads or some other critical feature. Heck, you might just like working with somebody that happens to love Python and choose that as a result.

The important thing is that there is a language that is the “optimum fit” for the tradeoffs you are willing to make and you will be 100x happier choosing that language in the long run.

Stop the language wars

Just stop. And unless you actually intend on computing fibonnaci sequences in your product, stop showing me those stupid benchmarks. Also, note that no matter what your answers to the questions above are, the right answer will almost never be PHP (forgive me).

OpenCl in Action: Chapter 2

If you are reading OpenCL in Action and get stuck on section 2.3.3 (titled “Code example: testing device extensions”), it’s because there are several bugs in the code that will manifest themselves at runtime.

Line 30 contains the following line:

1
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, NULL, &num_devices);

The issue is that the third parameter indicates that this function call should read one device into the fourth parameter, which is NULL. As this call is only trying to retrieve the number of devices available, the simple fix is to replace it with:

1
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);

If you run the code now however, you’ll run into a segmentation fault. Using gdb (or your debugger of choice) and stepping through the code, you’ll find that the segfault happens on line 50:

1
2
clGetDeviceInfo(devices[i], CL_DEVICE_ADDRESS_BITS,
                sizeof(ext_data), &addr_data, NULL);

The clear bug here is that the third argument is the size of ext_data which is a 4096 character array. It should be a cl_uint (the type of addr_data). Replacing this code with:

1
2
clGetDeviceInfo(devices[i], CL_DEVICE_ADDRESS_BITS,
                sizeof(addr_data), &addr_data, NULL);

will fix the problem. I will be posting fixes like this to this blog and committing the changes to this repository.

I may also end up porting all the sample code to use the C++ bindings.

Thoughts on Using Chef

After masquerading as a DevOps engineer for the last few months (largely out of necessity), I thought it would be good to write up my thoughts on the experience as a whole. Much of this will largely be about Chef – how I chose to use Chef and my experience working with it.

Framework Complexity

If you’ve been following the ebb and flow of trends in DevOps, you’ll find that things seem to go in an out of fashion very quickly. Initially, system deploys were largely the realm of system administrators manning a mixture of shell scripts or Perl scripts. Occasionally, you’d also find instances of CFEngine in the wild. As the startup movement occurred and developers at leaner and smaller companies found themselves doing operations, many other frameworks such as Chef, Puppet, and Ansible started rising in popularity.

Initially, when I was considering the various options for deploying our stack, shell scripts or small frameworks like Babushka or Sprinkle were very appealing. Chef and Puppet in particular seemed to have a lot of dependencies and the ramp time felt long (they also both suffered from what I consider subpar documentation at the time). Fast forward a few months and I’m a happy Chef user. Why the change?

Why the framework?

When I interview candidates for system administrator or developer operations roles, I am often curious if they understand why the framework is superior in the first place. The answer lies in the fact that the dominant configuration management frameworks are declarative. For the uninitiated, declarative programming is a strange thing.

Declarative programming

In procedural programming, the programming informs the computer of a series of actions that the computer should perform. Most scripts take this form, in addition to programs written in a systems language such as C (e.g. Do this, then do that).

Conversely, the object oriented programmer informs the computer about a series of objects that act and interact with each other to accomplish the desired effect (e.g. This is a coffee maker which can brew coffee and can be cleaned).

In functional programming, the programmer defines a collection of functions that are composed and operate on inputs to produce a result (e.g. Here are instructions for taking the mean, and also computing the standard deviation).

Declarative programming is distinct from the others in that the programmer is defining not what the computer should do, but what the computer should be (e.g. There should be a file located at this path which reflects the current time).

Declarative programming is a natural fit for configuration management. Suppose for example, that I want an haproxy.cfg file to exist at /etc/haproxy/haproxy.cfg with some contents $CONTENTS. If I was to do this with shell scripts, I could just do:

1
echo $CONTENTS > /etc/haproxy/haproxy.cfg

Easy enough. But what if I also needed the permissions of that file to be 0700? I could modify the script:

1
2
echo $CONTENTS > /etc/haproxy/haproxy.cfg
chmod 0700 /etc/haproxy/haproxy.cfg

But what if that file is already created and owned by a different user? What if the file is there already and the state is correct? I wouldn’t want to repeat the operation again each time, so I’d have to write some conditional logic to check the diff against the contents I expect. Eventually, this simple script may evolve well beyond simple bash statements into a full blown program.

Imagine instead, if one could just write

1
2
3
4
5
6
file "/etc/haproxy/haproxy.cfg" do
  content @content
  mode 00700
  user "haproxy"
  group "haproxy"
end

Ignoring the specifics of the syntax, this is a declaration of what some aspect of the system should be. The entire declaration is a declared resource and it is provisioned by the underlying resource provider, in this case, the file. The file resource provider knows how to compute how the current state is inconsistent with the desired state and converge the former to the latter.

Chef

Programming in Chef then, is just a matter of building up a long list of these resources. Most of the resources will be files or packages that are provisioned by the built-in resource providers (file and package are two examples). Chef will then faithfully try to converge the system to the desired declared state, doing nothing if there is nothing to do.

Not understanding how the framework operates can create some gotcha moments. For example, suppose I wanted to have a file at the path /tmp/haproxy_last_update that contains the last time haproxy was updated (contrived I know). Since Chef lets the developer inline Ruby code in the Chef recipe, the neophyte Chef programmer may be tempted to write something like:

1
2
3
4
5
6
7
8
9
10
file "/etc/haproxy/haproxy.cfg" do
  content @content
  mode 00700
  user "haproxy"
  group "haproxy"
end
haproxy_update_time = File.mtime("/etc/haproxy/haproxy.cfg")
file "/tmp/haproxy_last_update"
  content haproxy_update_time
end

What will actually happen is that an exception will be thrown saying that "/etc/haproxy/haproxy.cfg" does not exist when File.mtime is called. If the file was already created, the contents of /tmp/haproxy_last_update would be the old update time as opposed to the new one with no exception thrown. What gives? The answer lies in how the Chef run is actually performed.

Chef Runs

Chef runs take a machine from one state to another in two phases. First, all the recipes are compiled and logic is run to produce a queue of resources that are to be applied. The issue with the usage of File.mtime above is that it will be excuted before the resource just above gets performed. The chef run will instead do the following:

  1. Push the file "/etc/haproxy/haproxy.cfg" resource onto the queue (does not actually modify the file!)
  2. Call File.mtime and store the result in a local variable
  3. Push the file "/tmp/haproxy_last_update" resource onto the queue (also does not actually modify any files)
  4. Assuming no other resources were declared, run the resource from step 1.
  5. Run the resource from step 2.

Now we see why the recipe behaved so strangely. Ruby logic embedded in recipe in this manner should assume that the machine has not left its original state since all resources are being pushed to a queue at the time the logic is invoked. So what’s the right way to write it? In this case, the right thing to do is to move the update time logic to a resource as well so that it will be invoked after the first resource.

Notifications and Subscriptions

This is running a bit long but the last pattern I wanted to discuss is that of recipe events. While it is possible to have tons of ruby logic in all the Chef recipes to determine what resources should run, or have all this logic embedded in ruby_block resources, the correct thing to use is the notification or the subscription.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
file "/etc/haproxy/haproxy.cfg" do
  content @content
  mode 00700
  user "haproxy"
  group "haproxy"
  notifies :action "ruby_block[/tmp/haproxy_last_update]", :create
end
ruby_block "/tmp/haproxy_last_update"
  block do
    haproxy_update_time = File.mtime("/etc/haproxy/haproxy.cfg")
    `echo ${haproxy_update_time} > /tmp/haproxy_last_update`
  end
  action :nothing
end

This snippet will have the second resource be pushed on the queue only if the first one was pushed to the queue. By chaining notifications, chef runs will be efficient and the recipes themselves will be lean. The best sorts of Chef recipes, in fact, are those with minimal code logic. That is because these recipes are by necessity more idiomatically declarative.

Writing your own resource providers

The last bit to learn once you get comfortable with using existing resource providers is to write your own! This will happen organically over time as you start reusing components across multiple recipes. What will probably happen is that your recipes will start as a simple list of resources. It will then expand as more complicated logic sets in, then contract afterwards as logic is refactored out into custom resource providers.

Conclusion

Having dabbled with Puppet, Ansible, and some other lighter weight frameworks, I think you can’t go wrong with Chef. There is a hefty upfront learning curve but I believe it pays dividends in the long run and the motivated individual can be up to speed in a weekend. I would recommend leveraging Opscode heavily on the free tier for learning purposes. In addition, if you cannot spare a machine to learn Chef, you can easily set up a virtual machine (I use Vagrant and test-kitchen for this).

On Choosing Dynamo

Introduction

The last few years have been a tumultuous journey for databases, and it has been interesting to follow the trends and attempt to make a decision for our new product being rolled out at Quark Games. For a long time, MySQL or PostgreSQL were de facto choices for a new web application or CMS. This changed in the valley however, when demands for even faster iterative cycles pushed developers into schemaless development, and ultimately into seriously evaluating NoSQL, not just for columnar tables intended for map reduce, but as their primary data store.

To meet the demand, more and more distributed and non-distributed NoSQL storage solutions have cropped up. Nearly all of them make the same promises of redundancy, effortless scaling, and flexible schema. Of course, these promises do not come for free, and many developers found themselves in the uncomfortable position of wishing that they were using MySQL again.

So where are we now? What database should we choose, and under what circumstances?

CP systems

RDBMS’s and NoSQL stores like HBase, Redis, RethinkDB, and Couchbase are CP databases that emphasize immediate consistency and will compromise availability in the case of node failure or a netsplit. This includes sharded configurations of MySQL or PostgreSQL. In these sorts of database systems, writes and reads are typically serialized through a single authoritative source through some sort of hashing algorithm. Losing a node will generally make a subset of keys unavailable for some time until failover is done to replace the node, or the leader election algorithm is performed on the lost keys to redistribute the load.

CP systems generally make more sense if your data is accessed from many places simultaenously (for a single key). Under these circumstances, immediate consistency allows for atomic operations either with check-and-set functionality, or a locking mechanism. Redis, for example, provides SETEX, WATCH, MULTI, and a number of other commands to coordinate data changes. This is important for counters, shared lists, and other data that is expected to be read and written to from a variety of sources.

AP systems

AP systems typically fall under the category of database based on the Dynamo paper. These databases replicate data from node to node and allow reads and writes to any of the nodes that house the particular key in question. As such, they provide eventual consistency instead of immediate consistency and will continue to allow reads and writes even in the presence of node failure. Examples of AP databases are Riak, Cassandra, Voldemort, and DynamoDB.

AP systems generally make more sense if each chunk of your data is usually accessed from one place at a time. It is unlikely in this case that eventual consistency will be as significant, and the application can afford to do read repair or serialize writes from a single source. AP systems are obviously preferable if availability is important! For some applications, you’d rather display something rather than nothing at all.

Comparing CP and AP

CP systems will generally require less hardware to do the same number of operations because replication happens asynchronously. However, latency tails for CP systems can be expected to be higher because operations happen against a single node. In AP systems, since multiple nodes have a chance to respond to a request (like a race), latency tails are reduced significantly.

Handoff procedures in the face of node failure are also cheaper with AP systems whereas with CP systems, latency and throughput may suffer more. This is due to the additional coordination required to promote replica copies. With AP systems, the system behaves more or less as it did before, although latency may increase for a subset of keys. The cluster as a whole though, will be less affected.

Choosing

Our data is structured as so. Player X can read or write data for Player X. Player Y can read data from Player X but can’t write to it. As a result, data is almost entirely written from a single source (the application is stateful). As such, there are only several instances in the codebase where we need to account for eventual consistency and perform a simple repair opration. Generally, I believe that if you can go with an AP system, you should do so (but benchmark responsibly first). Availability and uptime don’t feel important until downtime occurs, and consistency can be managed. Lately, I have really enjoyed working with Riak, which has performed well in my benchmarks and seems to make the right design tradeoffs.