Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide

No lessons available

CONTENTS

18. Bloom Filters & Probabilistic Data Structures

Space-efficient data structures that trade a tiny error margin for massive memory savings.

Mar 5, 202617 views0 likes0 fires
18px

[!NOTE] A Bloom filter can tell you "this item is definitely NOT in the set" or "this item is probably in the set." It can never give a false negative, but it can give a false positive. This tiny compromise lets you check membership in a set of billions of items using only a few megabytes of memory.

The Membership Problem

In many systems, you need to quickly answer: "Have I seen this item before?"

  • Has this URL already been crawled? (web crawler)
  • Is this website malicious? (browser safe browsing)
  • Has the user already been recommended this article? (content feed)
  • Does this row exist in the database before doing an expensive disk read? (database optimization)
  • Has this email been sent before? (deduplication)

The naive approach—store every seen item in a HashSet—works until the set grows to billions of items. At that point, a HashSet of 1 billion strings consumes ~20 GB of RAM. A Bloom filter can answer the same question using ~1.2 GB with a 1% false positive rate. That''s a 15x memory reduction.

How Bloom Filters Work

  1. Start with a bit array of M bits, all initialized to 0.
  2. Choose K independent hash functions, each mapping an item to a position in the bit array.
  3. To add an item: hash it with all K functions, set those K bit positions to 1.
  4. To check an item: hash it with all K functions, check if ALL K positions are 1.
    • If any position is 0 → the item is definitely NOT in the set.
    • If all positions are 1 → the item is probably in the set (could be a false positive from other items setting those bits).
Bit array (M = 16):  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Add "apple" (K=3):   hash1→3, hash2→7, hash3→11
                     [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0]

Add "banana" (K=3):  hash1→1, hash2→7, hash3→14
                     [0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0]

Check "cherry":      hash1→3, hash2→5, hash3→11
  Position 5 is 0   → "cherry" is DEFINITELY NOT in the set ✓

Check "date":        hash1→1, hash2→7, hash3→11
  All positions are 1 → "date" is PROBABLY in the set (false positive!)

Implementation Pseudocode

class BloomFilter:    def init(self, expected_items, fp_rate=0.01):        # Calculate optimal size        self.size = int(-expected_items * ln(fp_rate) / (ln(2)**2))        self.num_hashes = int((self.size / expected_items) * ln(2))        self.bit_array = [0] * self.size
def add(self, item):
    for seed in range(self.num_hashes):
        index = hash(item, seed) % self.size
        self.bit_array[index] = 1

def might_contain(self, item):
    for seed in range(self.num_hashes):
        index = hash(item, seed) % self.size
        if self.bit_array[index] == 0:
            return False   # Definitely NOT in set
    return True            # Probably in set

  
    
    
  

Usage:
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)bf.add("user:123")bf.add("user:456")bf.might_contain("user:123")  # True (correct)bf.might_contain("user:789")  # False (correct) or True (false positive)

Tuning: False Positive Rate vs Memory

Items (N)False Positive RateBits Needed (M)MemoryHash Functions (K)
1 million1%~9.6M bits~1.2 MB7
1 million0.1%~14.4M bits~1.8 MB10
1 billion1%~9.6B bits~1.2 GB7
1 billion0.01%~19.2B bits~2.4 GB13

The formulas:

Optimal bit array size:  M = -(N × ln(p)) / (ln(2))²
Optimal hash functions:  K = (M / N) × ln(2)

Where:
  N = expected number of items
  p = desired false positive rate
  M = number of bits in the array
  K = number of hash functions

Rule of thumb: ~10 bits per item gives ~1% FP rate

[!IMPORTANT] Standard Bloom filters cannot delete items. Once a bit is set, you cannot know if other items also depend on that bit. If you need deletion, use a Counting Bloom Filter (each position is a counter instead of a bit) or a Cuckoo Filter.

Bloom Filter Variants

VariantKey FeatureTradeoffUse Case
Counting Bloom FilterSupports deletion (counters instead of bits)4× more memoryDynamic sets with frequent additions/removals
Scalable Bloom FilterGrows automatically as items are addedSlightly higher FP rateWhen you don''t know the expected item count upfront
Cuckoo FilterSupports deletion + lower FP at same memoryMore complex implementationModern replacement for Bloom filters
Quotient FilterCache-friendly, supports deletion and mergingMore complexSSD-based storage engines

Other Probabilistic Data Structures

HyperLogLog (Cardinality Estimation)

Answers: "Approximately how many unique items are in this data set?" Using only 12 KB of memory, HyperLogLog can estimate the number of unique elements in a set of billions with ~0.81% error. Redis has built-in HyperLogLog commands (PFADD, PFCOUNT).

Redis HyperLogLog example:
PFADD visitors "user:123" "user:456" "user:123"  # 123 counted oncePFCOUNT visitors → 2
Even with 100 million unique visitors:
Memory usage: still just 12 KB!
  • Use case: Counting unique visitors to a website, unique search queries per day, unique IPs accessing a service.

Count-Min Sketch (Frequency Estimation)

Answers: "Approximately how many times has this item appeared?" It uses multiple hash functions mapping to a 2D array of counters. It may over-count (never under-counts).

Count-Min Sketch (d=3 hash functions, w=8 columns):

       Col 0  Col 1  Col 2  Col 3  Col 4  Col 5  Col 6  Col 7
Hash1:   0      3      0      0      7      0      0      1
Hash2:   0      0      5      0      0      3      0      0
Hash3:   2      0      0      3      0      0      0      0

Query "trending_topic": hash1→4, hash2→2, hash3→3
  Counts: 7, 5, 3
  Estimate: min(7, 5, 3) = 3  (close to actual count)
  • Use case: Identifying trending hashtags on Twitter, detecting network traffic anomalies, tracking most-played songs on Spotify, finding heavy hitters in a data stream.

When to Use Which Data Structure

Question You Need AnsweredData StructureMemoryError Type
"Is this item in the set?"Bloom Filter~1.2 GB / 1B itemsFalse positives only
"How many unique items?"HyperLogLog12 KB (fixed!)±0.81% error
"How often did this item appear?"Count-Min SketchConfigurableOver-counts only
"What are the top-K items?"Count-Min Sketch + HeapConfigurableApproximate ranking
"Is this item in the set? (with deletion)"Cuckoo FilterSimilar to BloomFalse positives only

Real-World Usage

Google Chrome Safe Browsing

Chrome checks every URL you visit against a database of known malicious URLs. Instead of sending every URL to Google (privacy nightmare) or downloading the entire blacklist (too large), Chrome uses a Bloom filter stored locally. If the Bloom filter says "no" → the URL is safe (no network request needed). If the Bloom filter says "maybe" → Chrome sends a hash prefix to Google''s server for a definitive check. This reduces network requests by over 99%.

URL Visit Flow:
  URL → Local Bloom Filter
    → "Definitely safe" (99% of URLs) → Allow immediately
    → "Maybe malicious" (~1%) → Query Google server → Confirm/Deny

Cassandra (SSTable Lookup)

Cassandra stores data in immutable files called SSTables. When a read comes in, Cassandra might need to check multiple SSTables to find the data. Each SSTable has an associated Bloom filter. Before doing an expensive disk read, Cassandra checks the Bloom filter: "Could this key be in this SSTable?" If the Bloom filter says no, it skips that file entirely, saving significant I/O.

Impact: Without Bloom filters, a read might check 10 SSTables (10 disk reads). With Bloom filters, it typically checks only 1–2 (the ones that actually contain the key).

Medium (Article Recommendation Dedup)

Medium uses Bloom filters to avoid recommending articles a user has already read. Instead of querying the database for every candidate article ("has user X read article Y?"), they check a per-user Bloom filter. The occasional false positive (skipping an unread article) is acceptable—showing a duplicate is worse.

Bitcoin SPV Nodes

Lightweight Bitcoin nodes (SPV

  • Simplified Payment Verification) use Bloom filters to request only relevant transactions from full nodes, without revealing exactly which addresses they own. This provides a balance between privacy and bandwidth efficiency.

Common Mistakes

  • ❌ Confusing false positives with false negatives — Bloom filters never produce false negatives. If it says "no," trust it absolutely.
  • ❌ Trying to delete from a standard Bloom filter — use a Counting Bloom Filter or Cuckoo Filter instead.
  • ❌ Using a Bloom filter when exact membership is required — if you cannot tolerate any false positives (e.g., financial dedup), use a proper HashSet or database lookup.
  • ❌ Not sizing the filter correctly — if the actual number of items exceeds your initial estimate, the false positive rate increases dramatically. Use a Scalable Bloom Filter if uncertain.
  • ❌ Using too few hash functions — the optimal K is (M/N) × ln(2). Too few or too many hash functions increase the FP rate.

[!TIP] Key Takeaways:
• Bloom filters trade a small false positive rate for massive memory savings — 1 billion items in ~1.2 GB vs ~20 GB for a HashSet.
• No false negatives: "definitely not in the set" is always correct.
• ~10 bits per item gives ~1% FP rate. Optimal K ≈ (M/N) × ln(2).
• HyperLogLog counts unique items in 12 KB. Count-Min Sketch estimates frequencies.
• Cuckoo Filters are the modern replacement when you need deletion support.
• In interviews, mention Bloom filters for web crawlers (URL dedup), databases (SSTable lookup), and caches (avoiding unnecessary lookups).

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

mediumSystem Design
Quiz: Bloom Filters
5 questions5 min

Continue Learning

19. Database Replication

Intermediate
18 min

20. Leader Election & Consensus (Raft & Paxos)

Advanced
18 min

21. Distributed Transactions (Saga, 2PC, Outbox)

Advanced
18 min
Lesson 2 of 12 in 5. Advanced Concepts & Patterns
Previous in 5. Advanced Concepts & Patterns
17. Consistent Hashing
Next in 5. Advanced Concepts & Patterns
19. Database Replication
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories