Probabilistic Data Structures: The Theory Behind Bloom Filters, HyperLogLog, and Count-Min Sketch
When exact answers are too expensive, approximate answers with bounded error save the day — and the server budget.
Imagine trying to count every unique visitor to a website that receives 10 billion hits per day. Storing every single user ID would cost you gigabytes of RAM per server. Alternatively, you could use a structure that fits in 2.5 KB and gives you the right answer 98% of the time. That trade-off is exactly what probabilistic data structures are built for.
Before diving in, it helps to understand why these structures exist in the first place. In large-scale systems, three constraints regularly collide: data volume (too much to store exactly), memory budget (servers aren’t free), and latency requirements (answers must come back fast). When all three squeeze at once, exact data structures simply don’t scale. Probabilistic ones do — and they do it with mathematical guarantees on how wrong they can be.
In this article, we’ll walk through three of the most widely used probabilistic data structures: Bloom filters, HyperLogLog, and the Count-Min Sketch. We’ll explain not just what they do, but why the underlying math makes them trustworthy enough to deploy at Google, Facebook, Redis, and Cassandra scale.
A probabilistic data structure doesn’t give you wrong answers — it gives you approximate answers with quantifiable, bounded error. That’s a very different thing.
1. The Foundation: Why Hashing Makes This All Possible
All three structures rely on one core primitive: the hash function. A good hash function takes an arbitrary input and maps it to a fixed-size output in a way that appears random. Two critical properties make hashing useful for probabilistic structures:
First, uniformity: outputs are spread evenly across the output space. Second, determinism: the same input always produces the same output. Together, these properties let you encode information about a massive dataset into a compact bit array or counter table, then query it later with predictable accuracy.
The birthday paradox is worth understanding here, too. In a group of just 23 people, there’s a 50% chance that two share a birthday. Why? Because the number of possible pairings (253) grows much faster than the number of people. In hashing terms, collisions — two different inputs mapping to the same output — happen far sooner than intuition suggests. Probabilistic structures don’t eliminate collisions; instead, they account for them mathematically and use that accounting to bound error.
2. Bloom Filters: Fast Membership Testing with a Built-In Caveat
A Bloom filter answers one question: “Have I seen this item before?” It’s used everywhere — from checking whether a URL is malicious (Google Safe Browsing) to preventing duplicate writes in databases like Apache Cassandra.
How it works
A Bloom filter is simply a bit array of m bits, initially all set to zero, paired with k independent hash functions. When you insert an element, you run it through all k hash functions, each producing a position in the array, and set those bits to 1. To query whether an element exists, you check those same k positions — if all are 1, the element is probably in the set; if any is 0, it’s definitely not.
Notice the asymmetry there. That asymmetry is the entire point:
| Query Result | What It Means | Can You Trust It? |
|---|---|---|
| All bits = 1 | Element probably in set | Almost — false positives are possible |
| Any bit = 0 | Element definitely not in set | 100% — no false negatives ever |
Furthermore, the false positive rate is not arbitrary — it’s precisely calculable. After inserting n elements into a filter with m bits and k hash functions, the probability of a false positive is:
P(false positive) ≈ (1 − e−kn/m)k
This formula tells you something powerful: you can tune the filter to whatever error rate you need simply by adjusting m and k. For a 1% false positive rate with 1 million elements, you need roughly 9.6 bits per element — about 1.2 MB total. That’s a remarkable trade-off for a membership test that would otherwise require storing all 1 million values.
Below, you can see how the false positive rate changes as you vary the number of hash functions for a fixed filter size. Notice how there’s an optimal number of hash functions — too few or too many, and error creeps up.
Bloom Filter — False Positive Rate vs. Number of Hash Functions
In practice, Redis’s Bloom filter module and many open-source libraries handle the math for you — you simply specify your desired false positive rate and expected element count, and the library sizes the structure accordingly. The key engineering insight is that Bloom filters never lie in the negative direction. If the filter says “no,” you can trust that absolutely.
3. HyperLogLog: Counting the Uncountable
Now consider a different problem: you want to count how many distinct values appear in a stream of data — unique users, unique search queries, unique product views. This is the cardinality estimation problem, and it’s surprisingly hard to solve exactly at scale.
The naïve solution is to maintain a hash set. But for billions of distinct elements, a hash set can require gigabytes of memory. HyperLogLog, introduced by Flajolet, Fusy, Gandouet, and Meunier in their landmark 2007 paper, achieves approximately 2% error using just 1.5 KB of memory — regardless of whether you’re counting a million or a trillion distinct elements.
The intuition: leading zeros as a signal
The insight comes from the probabilistic behaviour of hash functions. When you hash elements uniformly, the probability of seeing a hash value with k leading zeros is 2−k. Therefore, if you observe a hash with 10 leading zeros, you can estimate you’ve seen roughly 210 = 1,024 distinct elements.
That’s a rough estimator, but HyperLogLog improves it dramatically by using m parallel registers. Each incoming hash value is split: the first few bits select a register bucket, and the remaining bits are used to track the maximum leading-zero run seen in that bucket. At query time, a harmonic mean of all register estimates is computed. This technique — originally from the Flajolet-Martin algorithm — gives consistent, low-variance results across all ranges of cardinality.
Real-world deployment: Redis has shipped HyperLogLog as a native data type since version 2.8. A full billion-element cardinality estimate is stored in just 12 KB, with ±0.81% error. See the Redis HyperLogLog docs for implementation details.
The chart below illustrates HyperLogLog’s memory efficiency compared to exact counting across different cardinality ranges. As the cardinality grows, the gap between exact and approximate memory requirements widens dramatically.
Memory Usage: Exact HashSet vs. HyperLogLog
It’s also worth noting that HyperLogLog is mergeable. You can run it independently on multiple servers and combine the results with a bitwise OR — a property that makes it ideal for distributed systems. Google used an extended variant, HyperLogLog++, in large-scale analytics pipelines for exactly this reason.
4. Count-Min Sketch: Frequency Estimation over Streaming Data
The third structure solves yet another problem: given a stream of events, how often has each item appeared? This is the heavy-hitter problem — finding the most frequent elements in a stream too large to store. It comes up in network intrusion detection (find the most frequent IP addresses), recommendation systems (find the most viewed items), and database query optimisation.
The Count-Min Sketch, introduced by Cormode and Muthukrishnan in 2005, uses a 2D array of counters with dimensions d × w, where d is the number of hash functions and w is the width of each row. Each incoming element is hashed by all d functions, and the corresponding counters in each row are incremented. To query an element’s frequency, you take the minimum across all d counter values — hence the name.
Why the minimum? Counters are inflated by collisions — multiple elements can increment the same counter. The minimum across independent rows gives the tightest upper bound on the true count, because at least one row is likely to have fewer collisions than the others.
Error guarantees and sizing
The Count-Min Sketch provides an overestimate guarantee: the estimated count is always ≥ the true count, never below. The error is bounded by ε × N, where ε is a user-specified error factor and N is the total number of insertions. Setting w = ⌈e/ε⌉ and d = ⌈ln(1/δ)⌉ guarantees that with probability 1−δ, the estimate exceeds the true count by at most εN.
| Structure | Problem Solved | Error Type | Memory (typical) | Used In |
|---|---|---|---|---|
| Bloom Filter | Membership testing | False positives only | ~10 bits per element | Cassandra, Chrome, BigTable |
| HyperLogLog | Cardinality estimation | ~±2% relative error | 1.5–12 KB fixed | Redis, Spark, BigQuery |
| Count-Min Sketch | Frequency estimation | Overestimates only | O(1/ε · log 1/δ) counters | Flink, Storm, network analytics |
5. The Math That Ties It Together: Linearity of Expectation
One reason these structures work so reliably is a beautiful theorem in probability called linearity of expectation. It states that the expected value of a sum of random variables equals the sum of their individual expected values — regardless of whether those variables are correlated.
In the context of HyperLogLog’s harmonic mean across registers, this means that even though each register’s estimate is noisy, the average across m registers converges to the true cardinality predictably. You don’t need the registers to be independent; linearity of expectation holds either way. This is what makes the error bound provable rather than merely empirical.
Similarly, in the Count-Min Sketch, linearity of expectation explains why the minimum-of-sums estimator converges: even though each row accumulates collision noise, the expected collision rate is bounded by 1/w per hash function, and taking the minimum across d rows drives the expected error down exponentially.
6. Putting It into Practice: When to Use Each
Choosing the right structure depends on what question you’re asking. As a rule of thumb, think about it this way: if you want to test membership, reach for a Bloom filter. If you want to count distinct things, reach for HyperLogLog. If you want to estimate frequencies, reach for Count-Min Sketch. These aren’t overlapping tools — each is shaped to its problem.
There are, of course, limitations to keep in mind. Bloom filters are not deletable by default (though counting Bloom filters address this). HyperLogLog’s error grows slightly for very small cardinalities and requires bias correction for small sets. Count-Min Sketch only handles positive frequency updates — negative counters (deletions) require a different variant called the CMS with deletions or a conservative update scheme.
Still, for the vast majority of streaming and large-scale analytics use cases, these three structures cover the essential ground. Moreover, they compose well: for example, you can use a Bloom filter as a pre-filter before a Count-Min Sketch to avoid counting elements you’ve never seen before, reducing noise in the sketch.
A minimal Bloom filter in Python
The following Python snippet is self-contained and runnable on Python 3.6+ with no external dependencies. It demonstrates a basic Bloom filter using the hashlib standard library module:
# Requires Python 3.6+ — no external packages needed
import hashlib
import math
class BloomFilter:
def __init__(self, n, p):
# n = expected elements, p = false positive rate
self.m = math.ceil(-n * math.log(p) / (math.log(2) ** 2))
self.k = math.ceil((self.m / n) * math.log(2))
self.bits = bytearray(self.m)
def _positions(self, item):
item = str(item).encode()
for i in range(self.k):
h = hashlib.sha256(item + i.to_bytes(2, 'big')).hexdigest()
yield int(h, 16) % self.m
def add(self, item):
for pos in self._positions(item):
self.bits[pos] = 1
def __contains__(self, item):
return all(self.bits[pos] for pos in self._positions(item))
# Example: 1 000 items, 1% false positive rate
bf = BloomFilter(n=1000, p=0.01)
bf.add("hello")
bf.add("world")
print("hello" in bf) # True
print("missing" in bf) # False (no false negative possible)
print(f"Bit array: {bf.m} bits, {bf.k} hash functions")
To run this, save it as bloom.py and execute python bloom.py in your terminal. No pip installs needed. The class automatically computes optimal m and k from your target parameters.
7. What We’ve Learned
Let’s take a step back and recap what we’ve covered in this article:
- Probabilistic data structures trade exactness for dramatic memory and speed gains, with mathematically bounded, predictable error — not random guesswork.
- Bloom filters answer membership queries with zero false negatives. By tuning the number of bits and hash functions, you can achieve any desired false positive rate at roughly 10 bits per element.
- HyperLogLog estimates cardinality — the count of distinct elements in a stream — with ~2% error using only kilobytes of memory, regardless of whether you’re counting millions or trillions of values.
- Count-Min Sketch estimates item frequencies over streaming data, always overestimating (never underestimating) by a factor bounded by ε × N.
- The underlying theory — hash uniformity, the birthday paradox, the Flajolet-Martin algorithm, and linearity of expectation — is what makes these error bounds provable and trustworthy in production systems.
- All three structures are in active production use at major tech companies through tools like Redis, Apache Cassandra, Apache Spark, Google BigQuery, and Apache Flink.





