18. Bloom Filters & Probabilistic Data Structures

[!NOTE] A Bloom filter can tell you "this item is definitely NOT in the set" or "this item is probably in the set." It can never give a false negative, but it can give a false positive. This tiny compromise lets you check membership in a set of billions of items using only a few megabytes of memory.

The Membership Problem

In many systems, you need to quickly answer: "Have I seen this item before?"

Has this URL already been crawled? (web crawler)
Is this website malicious? (browser safe browsing)
Has the user already been recommended this article? (content feed)
Does this row exist in the database before doing an expensive disk read? (database optimization)
Has this email been sent before? (deduplication)

The naive approach—store every seen item in a HashSet—works until the set grows to billions of items. At that point, a HashSet of 1 billion strings consumes ~20 GB of RAM. A Bloom filter can answer the same question using ~1.2 GB with a 1% false positive rate. That''s a 15x memory reduction.

How Bloom Filters Work

Start with a bit array of M bits, all initialized to 0.
Choose K independent hash functions, each mapping an item to a position in the bit array.
To add an item: hash it with all K functions, set those K bit positions to 1.
To check an item: hash it with all K functions, check if ALL K positions are 1.
- If any position is 0 → the item is definitely NOT in the set.
- If all positions are 1 → the item is probably in the set (could be a false positive from other items setting those bits).

Bit array (M = 16):  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Add "apple" (K=3):   hash1→3, hash2→7, hash3→11
                     [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0]

Add "banana" (K=3):  hash1→1, hash2→7, hash3→14
                     [0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0]

Check "cherry":      hash1→3, hash2→5, hash3→11
  Position 5 is 0   → "cherry" is DEFINITELY NOT in the set ✓

Check "date":        hash1→1, hash2→7, hash3→11
  All positions are 1 → "date" is PROBABLY in the set (false positive!)

Implementation Pseudocode

class BloomFilter:    def init(self, expected_items, fp_rate=0.01):        # Calculate optimal size        self.size = int(-expected_items * ln(fp_rate) / (ln(2)**2))        self.num_hashes = int((self.size / expected_items) * ln(2))        self.bit_array = [0] * self.size
def add(self, item):
    for seed in range(self.num_hashes):
        index = hash(item, seed) % self.size
        self.bit_array[index] = 1

def might_contain(self, item):
    for seed in range(self.num_hashes):
        index = hash(item, seed) % self.size
        if self.bit_array[index] == 0:
            return False   # Definitely NOT in set
    return True            # Probably in set

  
    
    
  

Usage:
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)bf.add("user:123")bf.add("user:456")bf.might_contain("user:123")  # True (correct)bf.might_contain("user:789")  # False (correct) or True (false positive)

Tuning: False Positive Rate vs Memory

Items (N)	False Positive Rate	Bits Needed (M)	Memory	Hash Functions (K)
1 million	1%	~9.6M bits	~1.2 MB	7
1 million	0.1%	~14.4M bits	~1.8 MB	10
1 billion	1%	~9.6B bits	~1.2 GB	7
1 billion	0.01%	~19.2B bits	~2.4 GB	13

The formulas:

Optimal bit array size:  M = -(N × ln(p)) / (ln(2))²
Optimal hash functions:  K = (M / N) × ln(2)

Where:
  N = expected number of items
  p = desired false positive rate
  M = number of bits in the array
  K = number of hash functions

Rule of thumb: ~10 bits per item gives ~1% FP rate

[!IMPORTANT] Standard Bloom filters cannot delete items. Once a bit is set, you cannot know if other items also depend on that bit. If you need deletion, use a Counting Bloom Filter (each position is a counter instead of a bit) or a Cuckoo Filter.

Bloom Filter Variants

Variant	Key Feature	Tradeoff	Use Case
Counting Bloom Filter	Supports deletion (counters instead of bits)	4× more memory	Dynamic sets with frequent additions/removals
Scalable Bloom Filter	Grows automatically as items are added	Slightly higher FP rate	When you don''t know the expected item count upfront
Cuckoo Filter	Supports deletion + lower FP at same memory	More complex implementation	Modern replacement for Bloom filters
Quotient Filter	Cache-friendly, supports deletion and merging	More complex	SSD-based storage engines

Other Probabilistic Data Structures

HyperLogLog (Cardinality Estimation)

Answers: "Approximately how many unique items are in this data set?" Using only 12 KB of memory, HyperLogLog can estimate the number of unique elements in a set of billions with ~0.81% error. Redis has built-in HyperLogLog commands (PFADD, PFCOUNT).

Redis HyperLogLog example:
PFADD visitors "user:123" "user:456" "user:123"  # 123 counted oncePFCOUNT visitors → 2
Even with 100 million unique visitors:
Memory usage: still just 12 KB!

Use case: Counting unique visitors to a website, unique search queries per day, unique IPs accessing a service.

Count-Min Sketch (Frequency Estimation)

Answers: "Approximately how many times has this item appeared?" It uses multiple hash functions mapping to a 2D array of counters. It may over-count (never under-counts).

Count-Min Sketch (d=3 hash functions, w=8 columns):

       Col 0  Col 1  Col 2  Col 3  Col 4  Col 5  Col 6  Col 7
Hash1:   0      3      0      0      7      0      0      1
Hash2:   0      0      5      0      0      3      0      0
Hash3:   2      0      0      3      0      0      0      0

Query "trending_topic": hash1→4, hash2→2, hash3→3
  Counts: 7, 5, 3
  Estimate: min(7, 5, 3) = 3  (close to actual count)

Use case: Identifying trending hashtags on Twitter, detecting network traffic anomalies, tracking most-played songs on Spotify, finding heavy hitters in a data stream.

When to Use Which Data Structure

Question You Need Answered	Data Structure	Memory	Error Type
"Is this item in the set?"	Bloom Filter	~1.2 GB / 1B items	False positives only
"How many unique items?"	HyperLogLog	12 KB (fixed!)	±0.81% error
"How often did this item appear?"	Count-Min Sketch	Configurable	Over-counts only
"What are the top-K items?"	Count-Min Sketch + Heap	Configurable	Approximate ranking
"Is this item in the set? (with deletion)"	Cuckoo Filter	Similar to Bloom	False positives only

Real-World Usage

Google Chrome Safe Browsing

Chrome checks every URL you visit against a database of known malicious URLs. Instead of sending every URL to Google (privacy nightmare) or downloading the entire blacklist (too large), Chrome uses a Bloom filter stored locally. If the Bloom filter says "no" → the URL is safe (no network request needed). If the Bloom filter says "maybe" → Chrome sends a hash prefix to Google''s server for a definitive check. This reduces network requests by over 99%.

URL Visit Flow:
  URL → Local Bloom Filter
    → "Definitely safe" (99% of URLs) → Allow immediately
    → "Maybe malicious" (~1%) → Query Google server → Confirm/Deny

Cassandra (SSTable Lookup)

Cassandra stores data in immutable files called SSTables. When a read comes in, Cassandra might need to check multiple SSTables to find the data. Each SSTable has an associated Bloom filter. Before doing an expensive disk read, Cassandra checks the Bloom filter: "Could this key be in this SSTable?" If the Bloom filter says no, it skips that file entirely, saving significant I/O.

Impact: Without Bloom filters, a read might check 10 SSTables (10 disk reads). With Bloom filters, it typically checks only 1–2 (the ones that actually contain the key).

Medium (Article Recommendation Dedup)

Medium uses Bloom filters to avoid recommending articles a user has already read. Instead of querying the database for every candidate article ("has user X read article Y?"), they check a per-user Bloom filter. The occasional false positive (skipping an unread article) is acceptable—showing a duplicate is worse.

Bitcoin SPV Nodes

Lightweight Bitcoin nodes (SPV

Simplified Payment Verification) use Bloom filters to request only relevant transactions from full nodes, without revealing exactly which addresses they own. This provides a balance between privacy and bandwidth efficiency.

Common Mistakes

❌ Confusing false positives with false negatives — Bloom filters never produce false negatives. If it says "no," trust it absolutely.
❌ Trying to delete from a standard Bloom filter — use a Counting Bloom Filter or Cuckoo Filter instead.
❌ Using a Bloom filter when exact membership is required — if you cannot tolerate any false positives (e.g., financial dedup), use a proper HashSet or database lookup.
❌ Not sizing the filter correctly — if the actual number of items exceeds your initial estimate, the false positive rate increases dramatically. Use a Scalable Bloom Filter if uncertain.
❌ Using too few hash functions — the optimal K is (M/N) × ln(2). Too few or too many hash functions increase the FP rate.

[!TIP] Key Takeaways:
• Bloom filters trade a small false positive rate for massive memory savings — 1 billion items in ~1.2 GB vs ~20 GB for a HashSet.
• No false negatives: "definitely not in the set" is always correct.
• ~10 bits per item gives ~1% FP rate. Optimal K ≈ (M/N) × ln(2).
• HyperLogLog counts unique items in 12 KB. Count-Min Sketch estimates frequencies.
• Cuckoo Filters are the modern replacement when you need deletion support.
• In interviews, mention Bloom filters for web crawlers (URL dedup), databases (SSTable lookup), and caches (avoiding unnecessary lookups).