Practical Rateless Set Reconciliation

williamsentosa · July 30, 2024, 11:36am

Presenter: Lei Yang (MIT)
Authors: Lei Yang (MIT), Yossi Gilad (Hebrew University of Jerusalem), Muhammad Alizadeh (MIT)

Synchronization is an essential problem in distributed systems, e.g., in peer-to-peer systems like blockchain. There is a known abstraction of the synchronization problem: set reconciliation. Imagine that Alice and Bob each hold a set of fixed-length items (e.g., file system blocks, database keys, etc.). The goal is to identify the items that are exclusive to one of them, so that we know the items they are missing from each other.

There are straightforward solutions that production systems have been using. One of them is for Alice to send the entire set to Bob, who can then figure out which items are missing from him. But this is very inefficient. A slightly more optimized solution is to send hashes or Bloom filters instead of the actual items, but the cost is still very high when the sets are large. An even more optimized solution is to build a Merkle tree of the items, but it still involves many rounds of interactions (communication overhead) to check the nodes.

A more promising solution (that the author explores) is to use a linear sketch. A sketch is a summary of a set. We can also decode a sketch to get the items in the set. The linear sketch has a useful property: linearity, such that we can add together the sketches of Alice and Bob, and the result will be the sketch of the symmetric difference between Alice and Bob. With this, Alice can just send the sketch of her set to Bob, and Bob will add that to his sketch, resulting in the sketch of their symmetric difference. Bob then decodes the sketch (called “peeling”) to recover the symmetric difference (if a sketch has at least twice as many symbols as the items which has been proven mathematically).

However, this sketch (which the author refers to as “traditional sketches”) is inflexible because of the sketch capacity constraint; Its capacity is roughly half of the symbols for perfect recovery. This capacity is also fixed when the sketch is constructed. The main problem is that we don’t know how many items there are in a symmetric difference. Setting it too big will be inefficient, and setting it too low will result in failure to recover data from the sketch.

Then, the author introduces the concept of a “universal sketch” design. This sketch has infinite symbols, which translates to infinite capacity. Also, each prefix of this sketch is also a sketch. Thus, we can incrementally generate a longer prefix as needed. The core contribution is that they designed such a universal sketch, which they call “Rateless Invertible Bloom Lookup Tables” (IBLTs). They claim that it is the first realization of a universal sketch with good computation and communication efficiency as past reconciliation protocols have suffered from high decoding complexity.

Using Rateless IBLTs, reconciling d items requires only 1.35d - 1.58d symbols on average (with proof in the paper). Computation cost is also low, as generating d symbols requires scanning the set log(d) times. Decoding d symbols takes d log d computation. The author also claims that it has a low communication cost. They derived these values based on mathematical proof and simulations.

There is also another advantage of having these universal symbols. For example, Alice can send its symbols to Bob and other parties. These symbols can also be incrementally updated. If Alice inserts or removes an item, she only needs to update O(log m), with m being the number of symbols.

The author mentions that Rateless IBLTs are easy to implement (<400 lines of code in Go). They also conducted an end-to-end evaluation to synchronize clients in the Ethereum blockchain and compared Rateless IBLTs and Merkle Tries in terms of completion time across the staleness of the data, managing to outperform Merkle Tries by 4x.

In conclusion, the author claims that it is the first solution to achieve low computation cost and near-optimal communication cost. It has two main features: rateless and universal, making it easy to deploy as it does not require estimating the set difference size to generate a sketch. It also achieves low communication and computation costs, backed by mathematical analysis. The implementation is also open source.

Q&A:
Q1. What is the most immediate apps that can benefit from Rateless IBLTs for networking community?
A1. RPKI, synchronizing public keys or record. When you want to synchronize small set, it will be ok, but when you have large sets across two copies, Rateless IBLTs will be much better.