Cuckoo and Bloom filters: probabilistic efficient caching.

In this article, I wanted to tell you about a highly efficient cache design using probabilistic data structures. This technique is used in many modern applications, including web browsers, databases, and distributed systems.

Cuckoo and Bloom filters are two popular types of probabilistic data structures that provide very space-efficient ways to store and lookup data that answers whether an element is a member of a set or not. The idea is to use optimized hashing for the elements instead of storing the elements themselves.

Those data structures are probabilistic in their nature, meaning that they can return false positives but never false negatives. Even though both algorithms allow less than 1% chance of false positives, this limits the algorithm's usefulness in many use cases. This is the price to pay for significant space efficiency and high lookup speeds.

The Cuckoo filter has one big advantage over the Bloom filter: it allows deletion of elements from the set, which can be very useful in numerous instances. It also provides better lookup performance and better space efficiency for applications requiring low false positive rates (< 3%). But before we dive into the details of the algorithms, let's take a look at some use cases where they can be used.

Use cases:

Financial fraud detection. To check whether a recipient was already flagged or not. Or to check whether his card was marked as stolen or not. An additional benefit of using a probabilistic filter in this case is that financial organizations can exchange their lists of stolen or blocked credit card numbers without revealing the numbers themselves. Many other security-related checks can be cached by this algorithm.
Check if username is taken or not. Checks where username/email/domain/etc. is already taken or not.
Handling a large amount of data requests that return "not found" errors. To check if data is not found before checking in a database or 3rd party service. Imagine you have an API that returns "not found" errors for 75% of requests. This is where probabilistic filters shine because they don't produce false negatives.
Ad campaigns. To show an ad based on if the user has signed up for a campaign or not. This is a good use case for the Cuckoo filter because it allows to delete users from the campaign if they unsubscribe.
Firewall network security, DDoS protection systems, etc. to efficiently check whether an IP address is already blocked or not.
Referee codes, discount/promo coupons. To check if a user has already used a code or not or if a code is valid or not. This is a good use case for the Cuckoo filter because it allows to delete codes from the set if they are used.
CDNs and caching. To optimize storage and request efficiency. For example, using Bloom filters to prevent caching of resources that are requested only once.
Blockchain events / logs searches. Ethereum is known to use Bloom filters to optimize the search for relevant logs in a block.

Those algorithms are mainly used with large data sets. The key takeaway is that you can use them to answer whether an element is a member of a set only for the cases where you can afford to have ~1% false positives. But you can use them to answer if an element is not a member of a set with 100% accuracy if you will be able to sync and maintain this set properly.

So if you need to have 100% accuracy in answering whether an element is a member of a set, you should use regular hash tables or sets instead. But you will pay the price of space and time complexity for that.

Choosing the right algorithm or tool is always a trade-off.

The Bloom filter.

A Bloom filter is a probabilistic data structure that is used to check if an element might exist in a set or definitely not. So it can guarantee only the absence of an element in the set but not its presence. Even though the false positive rate can be drastically reduced by increasing the number of hash functions, it can never be eliminated completely. However, Bloom filters do not permit deletion of items from the set.

When an item is passed through a fast hashing function, bits from the hash are sampled and set from 0 to 1 at specific intervals in a bitfield. This is how bloom filters operate. The same bits are sampled to see if they are present in a Bloom filter. Since a relational hash function generates unique identifiers, even though many items may have bits that overlap, we can be certain that a single bit from the hash hasn't been added before if it remains a 0.

A Bloom filter is an array of many bits. When an element is added to a bloom filter, the element is hashed (corresponding bits are set to 1). To check whether an item is present or not, the hash is computed and the filter sees whether the corresponding bit is set or not. Of course, this is subject to collisions. If a collision occurs, the filter will return a false positive.

To reduce the risk of collisions, an entry may use more than one bit per element. Generally, the more bits per element, the lower the likelihood of false positives. Another value affecting the accuracy of a Bloom filter is its fill ratio, or how many bits in the filter are actually set. The more bits that are set, the higher the chance of a false positive.

Choosing the right number of bits per element and the target fill ratio will be determined case by case. It depends on the data size, data added rate, acceptable false positive rate, size of filter limitation, etc. The internet is full of formulas to calculate the optimal configuration for the Bloom filter. I want to talk more about practical implementations!

Bloom filter usage in Go.

You would expect me here to show you a full implementation of a Bloom filter from scratch, but it rarely makes sense for production web applications. Most likely, you would like to use it with Redis or some Redis-like solution that already has it implemented and optimized for you.

Redis has a built-in support for Bloom filters. It was implemented as a separate module called RedisBloom and it is available on GitHub for Redis versions below 7. Starting from Redis 8 (May 2025), it is included in the core Redis distribution (not only the Bloom filter but a vast collection of probabilistic data structures as well).

This is a code example of how to use Redis Bloom filters in Golang:

// Creates an empty Bloom filter with a single sub-filter for the initial specified capacity(1000) and with an upper bound error_rate(0.01).
res1, err := rdb.BFReserve(ctx, "bikes:models", 0.01, 1000).Result()
if err != nil {
  panic(err)
}

fmt.Println(res1) // >>> OK

res2, err := rdb.BFAdd(ctx, "bikes:models", "Smoky Mountain Striker").Result()
if err != nil {
  panic(err)
}

fmt.Println(res2) // >>> true

res3, err := rdb.BFExists(ctx, "bikes:models", "Smoky Mountain Striker").Result()
if err != nil {
  panic(err)
}

fmt.Println(res3) // >>> true

This example is taken from official Redis documentation: https://redis.io/docs/latest/develop/data-types/probabilistic/bloom-filter/. With functions like BFReserveExpansion, BFReserveNonScaling and BFReserveWithArgs you can control not only the initial capacity and error rate but also the scaling behavior of the filter.

The required number of bits per item, given the desired error_rate and the optimal number of hash functions, is -ln(error_rate) / ln(2)^2. Hence, the required number of bits in the filter is capacity * -ln(error_rate) / ln(2)^2.

1% error rate requires 7 hash functions and 9.585 bits per item.
0.1% error rate requires 10 hash functions and 14.378 bits per item.
0.01% error rate requires 14 hash functions and 19.170 bits per item.

So the total size of the filter with 1000 items and 0.01% error rate is 1000 * 19.170 = 19170 bits = 2396 bytes = 2.4 kilobytes. And the total size of the filter with 1000 items and 1% error rate is 1000 * 9.585 = 9585 bits = 1198 bytes = 1.2 kilobytes. Only requires twice more space for 0.01% vs 1% false positive rate difference.

Once again, the key tradeoff is between the size of the filter and the false positive rate. Depending on your data set size, available memory, and acceptable false positive rate, you can choose the right configuration for your Bloom filter.

The Cuckoo filter.

The Cuckoo filter is a more advanced probabilistic data structure used for high-speed set membership testing (like the Bloom filter). The main advantage over the Bloom filter is that Cuckoo allows deletion of items from the set, which is not possible with the Bloom filter. The Cuckoo filter also provides better lookup performance and space efficiency compared to the Bloom filter.

I must mention that my analysis of the Cuckoo filter is primarily based on the original research paper from 2014: Cuckoo Filter: Practically Better Than Bloom. I recommend you read it as well if you want to understand the algorithm in more detail and check its performance against various Bloom filter variations.

The Cuckoo filter is an array of buckets, storing fingerprints of the values in one of the buckets at positions decided by the two hash functions. A membership query for item x looks for x's fingerprint in all potential buckets and returns true if an identical fingerprint is found. The size of a cuckoo filter's fingerprint directly affects the false positive rate.

Cuckoo filter behavior suggests that the fill rate will probably never reach 100% since the filter is likely to declare itself full before capacity is reached. The filter will automatically create more sub-filters when it declares itself full, resulting in decreased performance and an increased error rate. The size of the previous sub-filter is multiplied by expansion rate to create the new sub-filter. The error rate increases linearly with the number of sub-filters, just like bucket size.

Cuckoo filter usage in Go.

Yet again, I would like to provide you with the real usage example, not the full implementation of the algorithm. The Cuckoo filter is also available in Redis in the same RedisBloom module, which is part of Redis 8 core distribution.

The usage is similar to the Bloom filter, but I would recommend using CFReserveWithArgs at all times. Choosing the right bucket size, max iterations, and expansion multiplier is crucial for the performance of the Cuckoo filter. It is a more complex filter to set up than Bloom if you want to achieve performance gains.

res1, err := rdb.CFReserveWithArgs(ctx, "bikes:models", &redis.CFReserveOptions{
  // The filter will not fill up to 100%, make sure to reserve extra capacity if you want to avoid expansions.
  Capacity:      1000,
  // The minimal false positive error rate is 2/255 ≈ 0.78% when bucket size of 1 is used.
  BucketSize:    1,
  // Number of attempts to swap items between buckets before declaring filter as full and creating an additional filter.
  MaxIterations: 20,
   // When a new filter is created, its size is the size of the current filter multiplied by expansion.
  Expansion:     2,
}).Result()

if err != nil {
  panic(err)
}

fmt.Println(res1) // >>> OK

res2, err := rdb.CFAdd(ctx, "bikes:models", "Smoky Mountain Striker").Result()
if err != nil {
  panic(err)
}

fmt.Println(res2) // >>> true

res3, err := rdb.CFExists(ctx, "bikes:models", "Smoky Mountain Striker").Result()
if err != nil {
  panic(err)
}

fmt.Println(res3) // >>> true

res4, err := rdb.CFDel(ctx, "bikes:models", "Smoky Mountain Striker").Result()

if err != nil {
  panic(err)
}

fmt.Println(res4) // >>> true

This example is mostly taken from official Redis documentation: https://redis.io/docs/latest/develop/data-types/probabilistic/cuckoo-filter/. The CFReserveWithArgs function allows you to specify the bucket size, max iterations, and expansion multiplier, which are crucial for the performance of the Cuckoo filter.

A wholly different approach in filter setup makes it harder to compare with the Bloom filter directly. Which is why I once again refer you to the original paper for more details on the algorithm and its linear performance change based on the number of sub-filters and bucket size.

Cache fill control.

When using probabilistic filters, it is important to understand that they are not a replacement for a regular cache. They are a complementary tool that can be used to improve the performance of a cache or other data sources.

Since those filters will not fill themselves with hashes, we need to talk about it. In this part, I want to list popular cache fill patterns that you might use.

Write-Through Pattern. In this pattern, all write operations to the data source are first written to the cache (and filters), and then to the data source. Adds overhead for write, update, and delete operations but ensures that the cache is always in sync.
In-Memory Pattern. Load all data hashes into the filter once (e.g., at application startup, with CRON jobs, workflows, etc.) and keep it in memory. This is the simplest pattern, but it requires that all data fits into memory and can be loaded at once.
~~Cache-aside Pattern.~~ This is the most common caching pattern, where the application code is responsible for checking the cache first and then fetching the data from the data source if it is not found in the cache. This pattern is almost useless for probabilistic filters. I just wanted to mention it here for completeness.

There are more other patterns as well, but I wanted to focus your attention first on the two most popular for probabilistic cache: Write-Through and In-Memory. The choice of the pattern depends on the use case, data size, and performance requirements.

Conclusion.

While Cuckoo filters and Bloom filters are both used in common applications, they have different strengths and weaknesses. Bloom filters are simpler to implement and use, but they do not allow deletion of items from the set. Cuckoo filters are more complex, but they provide better performance and allow deletion.

In my opinion, I would rather use the Bloom filter over Cuckoo in most cases, even now, 11 years after the Cuckoo filter paper was published. My main reasons are those:

Bloom filter has gained widespread adoption and has been supported by many libraries and tools for years now
Bloom filter is significantly simpler to implement and use compared to Cuckoo filter (if you need to implement it from scratch)
Bloom filter has more predictable false positive rates and performance characteristics. It is easier to calculate the required size of the filter and the number of hash functions needed for a given error rate
Bloom filter is already significantly faster and more space-efficient than regular hash tables or sets, which is the main goal of using it in the first place. Picking up the Cuckoo filter over the Bloom filter should be done with careful consideration and benchmarks based on real data size and access patterns
Bloom filters typically exhibit better performance and scalability when inserting items (if you need to add data frequently)
Deletion feature of the Cuckoo filter might be a huge advantage in some cases, but there are some workarounds to achieve similar behavior with the Bloom filter. For example, using Counting Bloom filter allows you to delete items from the set for the price of 3-4x space overhead against the classic Bloom filter. It is not currently supported by Redis out of the box, but you can implement it yourself or use a library that supports it

So, in short, if you need a simple and efficient way to check if an element is a member of a set, use the Bloom filter. If you need to delete items from the set and can afford the complexity overhead, use the Cuckoo filter. If your main goal is to use the most efficient algorithm, still start with the Bloom filter. You will know in practice if you need to switch to the Cuckoo filter later on based on your data size and access patterns.

Probabilistic data structures proved to be very useful and efficient over the years, and they are widely used in applications working with big data sets or high RPS requirements. They are a great tool to have in your toolbox, but I have to emphasize one thing:

Never engage in premature optimizations.

If you implement something new, start small. Make it happen first; you can always optimize it later. Use hash tables or sets to cache some steps, and then if you see that you have a problem with space or performance of your cache, consider using Bloom filters or Cuckoo filters.

Sources:

Cuckoo filter paper: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
Redis Bloom official documentation: https://redis.io/docs/latest/develop/data-types/probabilistic/bloom-filter/
Redis Cuckoo official documentation: https://redis.io/docs/latest/develop/data-types/probabilistic/cuckoo-filter/
Bloom Filter Python implementation in great details: https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/

25 May

2025