[!NOTE] A Bloom filter can tell you "this item is definitely NOT in the set" or "this item is probably in the set." It can never give a false negative, but it can give a false positive. This tiny compromise lets you check membership in a set of billions of items using only a few megabytes of memory.
The Membership Problem
In many systems, you need to quickly answer: "Have I seen this item before?"
- Has this URL already been crawled? (web crawler)
- Is this website malicious? (browser safe browsing)
- Has the user already been recommended this article? (content feed)
- Does this row exist in the database before doing an expensive disk read? (database optimization)
- Has this email been sent before? (deduplication)
The naive approach—store every seen item in a HashSet—works until the set grows to billions of items. At that point, a HashSet of 1 billion strings consumes ~20 GB of RAM. A Bloom filter can answer the same question using ~1.2 GB with a 1% false positive rate. That''s a 15x memory reduction.
How Bloom Filters Work
- Start with a bit array of M bits, all initialized to 0.
- Choose K independent hash functions, each mapping an item to a position in the bit array.
- To add an item: hash it with all K functions, set those K bit positions to 1.
- To check an item: hash it with all K functions, check if ALL K positions are 1.
- If any position is 0 → the item is definitely NOT in the set.
- If all positions are 1 → the item is probably in the set (could be a false positive from other items setting those bits).
Bit array (M = 16): [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Add "apple" (K=3): hash1→3, hash2→7, hash3→11
[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0]
Add "banana" (K=3): hash1→1, hash2→7, hash3→14
[0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0]
Check "cherry": hash1→3, hash2→5, hash3→11
Position 5 is 0 → "cherry" is DEFINITELY NOT in the set ✓
Check "date": hash1→1, hash2→7, hash3→11
All positions are 1 → "date" is PROBABLY in the set (false positive!)
Implementation Pseudocode
class BloomFilter: def init(self, expected_items, fp_rate=0.01): # Calculate optimal size self.size = int(-expected_items * ln(fp_rate) / (ln(2)**2)) self.num_hashes = int((self.size / expected_items) * ln(2)) self.bit_array = [0] * self.size
def add(self, item):
for seed in range(self.num_hashes):
index = hash(item, seed) % self.size
self.bit_array[index] = 1
def might_contain(self, item):
for seed in range(self.num_hashes):
index = hash(item, seed) % self.size
if self.bit_array[index] == 0:
return False # Definitely NOT in set
return True # Probably in set
Usage:
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)bf.add("user:123")bf.add("user:456")bf.might_contain("user:123") # True (correct)bf.might_contain("user:789") # False (correct) or True (false positive)
Tuning: False Positive Rate vs Memory
| Items (N) | False Positive Rate | Bits Needed (M) | Memory | Hash Functions (K) |
|---|---|---|---|---|
| 1 million | 1% | ~9.6M bits | ~1.2 MB | 7 |
| 1 million | 0.1% | ~14.4M bits | ~1.8 MB | 10 |
| 1 billion | 1% | ~9.6B bits | ~1.2 GB | 7 |
| 1 billion | 0.01% | ~19.2B bits | ~2.4 GB | 13 |
The formulas:
Optimal bit array size: M = -(N × ln(p)) / (ln(2))²
Optimal hash functions: K = (M / N) × ln(2)
Where:
N = expected number of items
p = desired false positive rate
M = number of bits in the array
K = number of hash functions
Rule of thumb: ~10 bits per item gives ~1% FP rate
[!IMPORTANT] Standard Bloom filters cannot delete items. Once a bit is set, you cannot know if other items also depend on that bit. If you need deletion, use a Counting Bloom Filter (each position is a counter instead of a bit) or a Cuckoo Filter.
Bloom Filter Variants
| Variant | Key Feature | Tradeoff | Use Case |
|---|---|---|---|
| Counting Bloom Filter | Supports deletion (counters instead of bits) | 4× more memory | Dynamic sets with frequent additions/removals |
| Scalable Bloom Filter | Grows automatically as items are added | Slightly higher FP rate | When you don''t know the expected item count upfront |
| Cuckoo Filter | Supports deletion + lower FP at same memory | More complex implementation | Modern replacement for Bloom filters |
| Quotient Filter | Cache-friendly, supports deletion and merging | More complex | SSD-based storage engines |
Other Probabilistic Data Structures
HyperLogLog (Cardinality Estimation)
Answers: "Approximately how many unique items are in this data set?" Using only 12 KB of memory, HyperLogLog can estimate the number of unique elements in a set of billions with ~0.81% error. Redis has built-in HyperLogLog commands (PFADD, PFCOUNT).
Redis HyperLogLog example:
PFADD visitors "user:123" "user:456" "user:123" # 123 counted oncePFCOUNT visitors → 2
Even with 100 million unique visitors:
Memory usage: still just 12 KB!
- Use case: Counting unique visitors to a website, unique search queries per day, unique IPs accessing a service.
Count-Min Sketch (Frequency Estimation)
Answers: "Approximately how many times has this item appeared?" It uses multiple hash functions mapping to a 2D array of counters. It may over-count (never under-counts).
Count-Min Sketch (d=3 hash functions, w=8 columns):
Col 0 Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7
Hash1: 0 3 0 0 7 0 0 1
Hash2: 0 0 5 0 0 3 0 0
Hash3: 2 0 0 3 0 0 0 0
Query "trending_topic": hash1→4, hash2→2, hash3→3
Counts: 7, 5, 3
Estimate: min(7, 5, 3) = 3 (close to actual count)
- Use case: Identifying trending hashtags on Twitter, detecting network traffic anomalies, tracking most-played songs on Spotify, finding heavy hitters in a data stream.
When to Use Which Data Structure
| Question You Need Answered | Data Structure | Memory | Error Type |
|---|---|---|---|
| "Is this item in the set?" | Bloom Filter | ~1.2 GB / 1B items | False positives only |
| "How many unique items?" | HyperLogLog | 12 KB (fixed!) | ±0.81% error |
| "How often did this item appear?" | Count-Min Sketch | Configurable | Over-counts only |
| "What are the top-K items?" | Count-Min Sketch + Heap | Configurable | Approximate ranking |
| "Is this item in the set? (with deletion)" | Cuckoo Filter | Similar to Bloom | False positives only |
Real-World Usage
Google Chrome Safe Browsing
Chrome checks every URL you visit against a database of known malicious URLs. Instead of sending every URL to Google (privacy nightmare) or downloading the entire blacklist (too large), Chrome uses a Bloom filter stored locally. If the Bloom filter says "no" → the URL is safe (no network request needed). If the Bloom filter says "maybe" → Chrome sends a hash prefix to Google''s server for a definitive check. This reduces network requests by over 99%.
URL Visit Flow:
URL → Local Bloom Filter
→ "Definitely safe" (99% of URLs) → Allow immediately
→ "Maybe malicious" (~1%) → Query Google server → Confirm/Deny
Cassandra (SSTable Lookup)
Cassandra stores data in immutable files called SSTables. When a read comes in, Cassandra might need to check multiple SSTables to find the data. Each SSTable has an associated Bloom filter. Before doing an expensive disk read, Cassandra checks the Bloom filter: "Could this key be in this SSTable?" If the Bloom filter says no, it skips that file entirely, saving significant I/O.
Impact: Without Bloom filters, a read might check 10 SSTables (10 disk reads). With Bloom filters, it typically checks only 1–2 (the ones that actually contain the key).
Medium (Article Recommendation Dedup)
Medium uses Bloom filters to avoid recommending articles a user has already read. Instead of querying the database for every candidate article ("has user X read article Y?"), they check a per-user Bloom filter. The occasional false positive (skipping an unread article) is acceptable—showing a duplicate is worse.
Bitcoin SPV Nodes
Lightweight Bitcoin nodes (SPV
- Simplified Payment Verification) use Bloom filters to request only relevant transactions from full nodes, without revealing exactly which addresses they own. This provides a balance between privacy and bandwidth efficiency.
Common Mistakes
- ❌ Confusing false positives with false negatives — Bloom filters never produce false negatives. If it says "no," trust it absolutely.
- ❌ Trying to delete from a standard Bloom filter — use a Counting Bloom Filter or Cuckoo Filter instead.
- ❌ Using a Bloom filter when exact membership is required — if you cannot tolerate any false positives (e.g., financial dedup), use a proper HashSet or database lookup.
- ❌ Not sizing the filter correctly — if the actual number of items exceeds your initial estimate, the false positive rate increases dramatically. Use a Scalable Bloom Filter if uncertain.
- ❌ Using too few hash functions — the optimal K is (M/N) × ln(2). Too few or too many hash functions increase the FP rate.
[!TIP] Key Takeaways:
• Bloom filters trade a small false positive rate for massive memory savings — 1 billion items in ~1.2 GB vs ~20 GB for a HashSet.
• No false negatives: "definitely not in the set" is always correct.
• ~10 bits per item gives ~1% FP rate. Optimal K ≈ (M/N) × ln(2).
• HyperLogLog counts unique items in 12 KB. Count-Min Sketch estimates frequencies.
• Cuckoo Filters are the modern replacement when you need deletion support.
• In interviews, mention Bloom filters for web crawlers (URL dedup), databases (SSTable lookup), and caches (avoiding unnecessary lookups).