Faster commits with 1PC+C instead of 2PC
Introduction
redb is an embedded key-value database, and committing data and ensuring its durability is paramount. Up until version 0.4.0, it used the classic 2-phase commit (2PC) strategy in all cases. Version 0.4.0 adds a 1-phase commit with checksum (1PC+C) approach that reduces latency by 2x, making redb faster than, the well-known, lmdb for point insertions and other use cases with small transactions.
2-phase commit
2-phase commit involves writing all the data, performing an fsync to ensure it is durably stored, then toggling a metadata flag to promote that data to be visible to future transactions, and finally performing a second fsync. The first fsync is required to ensure that a partially written value does not become visible, in the event that the computer loses power during the second fsync.
The pseudocode for this looks something like the following:
file.write(&data);
file.fsync();
// Promote the written data to be the primary copy
file.promote_to_primary();
file.fsync();
1-phase commit with checksum
The latency of fsync is very high – on the order of milliseconds – so this approach seeks to speed up commits by alleviating the need for the first fsync. Recall that the first one is required to avoid a partially written value, in the event of a crash or loss of power. We can instead detect that case by adding a checksum to the data & metadata that is written, proceed optimistically, and cleanup a partially written value during crash recovery.
The pseudocode looks something like this:
file.write(&data);
file.write(checksum(&data));
// Promote the written data to be the primary copy
file.promote_to_primary();
file.fsync();
In redb’s implementation a Merkle tree with XXH3 for the checksum is used. XXH3 was chosen because it is extremely fast and statistically has very few collisions. All committed transactions in redb have a monotonically increasing identifier, and detecting a partially written value is done by comparing the last two (possibly partially) committed transactions. Whichever has the higher identifier and a valid checksum is used, and the other is discarded.
Benchmarks
Even though computing the checksum adds some computational cost, this approach is ~2x faster when committing small transactions, and only has a small slowdown when loading large amounts of data:
+--------------------+--------------+------------+--------+---------+
| | redb (1PC+C) | redb (2PC) | lmdb | sled |
+===================================================================+
| individual writes | 227ms | 381ms | 388ms | 642ms |
|--------------------+--------------+------------+--------+---------|
| bulk load | 1770ms | 1370ms | 976ms | 4534ms |
|--------------------+--------------+------------+--------+---------|