Secondary Indices in Riak With Erlang

For the uninitiated, Riak is a production ready distributed key value store. It is based on Amazon’s Dynamo paper and is faithful in it’s implementation.

Recently, I figured out how to use secondary indices in Riak using the Erlang client and am documenting its usage here for the benefit of future end-users. Secondary indices provide a way to query for sets of bucket/key pairs stored in Riak. The sets are defined by either a value or range of values on a specified index. The index can be attached to any riakc_obj prior to a put operation as metadata. Secondary indices are fast, and the preferable way to group associated data if complex operations don’t need to be performed on the contents of the value (for this purpose, there is map-reduce).

First, you’ll want to change the backend storage of Riak to LevelDB. The default bitcask storage option will complain that secondary indices are not supported. Incidentally, swapping storage solutions can be done at runtime, since Riak itself handles primarily the distribution of data, node failover, and the Dynamo principles.

To do this, find the app.config file in the etc folder of your riak build. If you compiled Riak from source using make rel, this file will be in rel/riak/etc. Find the lines:

{riak_kv, [
    {storage_backend, riak_kv_bitcask_backend},

and change it to

{riak_kv, [
    {storage_backend, riak_kv_eleveldb_backend},

Restarting Riak at this point will change the storage backend on the fly. Note that any data you had in bitcask is not gone, but persisted in a different place on the disk (the paths to these locations are elsewhere in the same config file)

At this point, lets create a few sample indexed objects:

Obj = riakc_obj:new(<<"employee">>, <<"jeremy">>, <<"engineer">>).
MetaData = dict:store(<<"index">>, [{"age_int", 23}, {"state_bin", "CA"}], dict:new()).
Obj1 = riakc_obj:update_metadata(Obj, MetaData).

Obj2 = riakc_obj:new(<<"employee">>, <<"helena">>, <<"scientist">>).
MetaData1 = dict:store(<<"index">>, [{"age_int", 32}, {"state_bin", "CA"}], dict:new()).
Obj3 = riakc_obj:update_metadata(Obj2, MetaData1).

{ok, Pid} = riakc_pb_socket:start_link("", 8087).
riakc_pb_socket:put(Pid, Obj1).

Here, I have inserted two objects into Riak, each representing a different employee. I have also inserted them with metadata about their age and state of residence. Note that object metadata is represented as a dictionary in Erlang. To query the data, I can do something like:

riakc_pb_socket:get_index(Pid, <<"employee">>, "age_int", 23)
riakc_pb_socket:get_index(Pid, <<"employee">>, "state_bin", "CA")

The first query will return a list of a single item: [<<"employee">>, <<"jeremy">>]. The second will return a list of two items: [<<"employee">>, <<"jeremy">>] and [<<"employee">>, <<"helena">>]. Note that the field names of the index must have either _bin or _int as a suffix to denote the type of field the index is. At this time, only binary and integer data can be used to index objects.

And that’s a wrap! In your own applications, you might want to consider grabbing an object’s metadata first before updating it so you don’t wipe out existing metadata. If you are still reading at this point, I hope you found this helpful. I would encourage experienced Riak users (or anybody really) to point out any errors I might have made so I can fix them for the good of the community. Thanks for reading.