Faster commits with 1PC+C instead of 2PC

Introduction

redb is an embedded key-value database, and committing data and ensuring its durability is paramount. Up until version 0.4.0, it used the classic 2-phase commit (2PC) strategy in all cases. Version 0.4.0 adds a 1-phase commit with checksum (1PC+C) approach that reduces latency by 2x, making redb faster than, the well-known, lmdb for point insertions and other use cases with small transactions.

2-phase commit

2-phase commit involves writing all the data, performing an fsync to ensure it is durably stored, then toggling a metadata flag to promote that data to be visible to future transactions, and finally performing a second fsync. The first fsync is required to ensure that a partially written value does not become visible, in the event that the computer loses power during the second fsync.

The pseudocode for this looks something like the following:

file.write(&data);
file.fsync();
// Promote the written data to be the primary copy
file.promote_to_primary();
file.fsync();

1-phase commit with checksum

The latency of fsync is very high – on the order of milliseconds – so this approach seeks to speed up commits by alleviating the need for the first fsync. Recall that the first one is required to avoid a partially written value, in the event of a crash or loss of power. We can instead detect that case by adding a checksum to the data & metadata that is written, proceed optimistically, and cleanup a partially written value during crash recovery.

The pseudocode looks something like this:

file.write(&data);
file.write(checksum(&data));
// Promote the written data to be the primary copy
file.promote_to_primary();
file.fsync();

In redb’s implementation a Merkle tree with XXH3 for the checksum is used. XXH3 was chosen because it is extremely fast and statistically has very few collisions. All committed transactions in redb have a monotonically increasing identifier, and detecting a partially written value is done by comparing the last two (possibly partially) committed transactions. Whichever has the higher identifier and a valid checksum is used, and the other is discarded.

Benchmarks

Even though computing the checksum adds some computational cost, this approach is ~2x faster when committing small transactions, and only has a small slowdown when loading large amounts of data:

+--------------------+--------------+------------+--------+---------+
|                    | redb (1PC+C) | redb (2PC) | lmdb   | sled    |
+===================================================================+
| individual writes  | 227ms        | 381ms      | 388ms  | 642ms   |
|--------------------+--------------+------------+--------+---------|
| bulk load          | 1770ms       | 1370ms     | 976ms  | 4534ms  |
|--------------------+--------------+------------+--------+---------|