RocksDB Embeddable Persistent Key-Value Store

Abdullah Ozturk - Blog
3 min readSep 14, 2015

RocksDB is an embeddable persistent key-value store for fast storage open-sourced by Facebook.

RocksDB builds on LevelDB (created by Google) to be scalable to run on servers with many CPU cores, to efficiently use fast storage and to support IO-bound, in-memory and write-once workloads. However, RocksDB excels when the data stored is larger than the size of RAM on the target machine.

RocksDB solidly outperformed LevelDB for these IO bound workload according to the benchmarks conducted by Facebook:

  • Bulk Load of keys in Random Order
  • Bulk Load of keys in Sequential Order
  • Write Performance
  • Read Performance

Thus, RocksDB can be used by applications that need low latency database accesses. You can install RocksDB and try it on your local machine by following the steps here.

Below sections show the example usages of RocksDB with code in C++.

Database Reads and Writes

The RocksDB library provides a persistent key value store. Here is an example usage of RocksDB in C++ with basic reads/writes.

Keys and values are arbitrary byte arrays. Slice type used above is a simple structure that contains a length and a pointer to an external byte array. Using Slice is a cheaper alternative to std::string since no need to copy potentially large keys and values.

Values of Status type above are returned by most functions in rocksdb that may encounter an error. You can check if such a result is ok, and also print an associated error message.

Batch Updates

The WriteBatch used below provides atomic updates to the database by holding a sequence of edits to be made to the database and applying these edits in order. It can also be used to speed up bulk updates by placing lots of individual mutations into the same batch.

Synchronous Writes

By default, each write operation is asynchronous. The sync flag can be turned on for a particular write to make the write operation not return until the data being written has been pushed all the way to persistent storage. Asynchronous writes are often faster than synchronous writes. However, the downside of asynchronous writes is that a crash of the machine may cause the last few updates to be lost. Note that a crash of just the writing process will not cause any loss even when sync is false.

Database Iteration

It is possible to print all key,value pairs in a database by using iterators. It is also possible to process just the keys in the range [start, limit); or process entries in reverse order, which is slower than forward iteration.

Expected output:

1: one 2: two 3: three 
2: two
3: three 2: two 1: one

Getting Snapshot

Snapshots provide consistent read-only views over the entire state of the key-value store.

Expected output:

1: one 2: two

Custom Comparator

All examples above used the default ordering function for key, which orders bytes lexicographically. You can however supply a custom comparator when opening a database. The keys are ordered within the key value store according to a user-specified comparator function. For example, suppose each database key consists of two numbers separated by semicolon and we should sort by the first number, breaking ties by the second number.

Expected output:

1:3: one 2:1: three 2:3: two

In RocksDB, a way to completely disable Write Ahead Log for a particular write is provided. Write-ahead logging (WAL) is a family of techniques for providing atomicity and durability in database systems.

When doing bulk-loading or big idempotent operations, you can disable syncing of data files by setting Options::disableDataSync to true before opening the database. Once the operation is finished, you can manually call sync() to flush all dirty buffers to stable storage.

A RockDB database may only be opened by one process at a time. The RocksDB implementation acquires a lock from the operating system to prevent misuse. Within a single process, the same rocksdb::DB object may be safely shared by multiple concurrent threads. However other objects (like Iterator and WriteBatch) may require external synchronisation.

It is possible to tune performance of your RocksDB by changing the default values of the types defined in include/rocksdb/options.h for block size, write buffer, compression, cache, key layout and filters.

In this post, I have tried to cover all basic usage of RocksDB database for low-latency persistent key-value storage in diverse application domains.

Please comment below for your questions.

Originally published at tech.aozturk.me on November 23, 2013.

--

--

Abdullah Ozturk - Blog

Software engineer. Distributed systems enthusiast. #data, #iot, #mobile, #scalability, #cplusplus, #java https://github.com/aozturk