Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
5. Advanced Concepts & Patterns
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

18. Bloom Filters & Probabilistic Data Structures

Space-efficient data structures that trade a tiny error margin for massive memory savings.

System Design: The Complete Guide
5. Advanced Concepts & Patterns
March 5, 2026
84
A

[!NOTE] A Bloom filter can tell you "this item is definitely NOT in the set" or "this item is probably in the set." It can never give a false negative, but it can give a false positive. This tiny compromise lets you check membership in a set of billions of items using only a few megabytes of memory.

The Membership Problem

In many systems, you need to quickly answer: "Have I seen this item before?"

  • Has this URL already been crawled? (web crawler)
  • Is this website malicious? (browser safe browsing)
  • Has the user already been recommended this article? (content feed)
  • Does this row exist in the database before doing an expensive disk read? (database optimization)
  • Has this email been sent before? (deduplication)

The naive approach—store every seen item in a HashSet—works until the set grows to billions of items. At that point, a HashSet of 1 billion strings consumes ~20 GB of RAM. A Bloom filter can answer the same question using ~1.2 GB with a 1% false positive rate. That''s a 15x memory reduction.

How Bloom Filters Work

  1. Start with a bit array of M bits, all initialized to 0.
  2. Choose K independent hash functions, each mapping an item to a position in the bit array.
  3. To add an item: hash it with all K functions, set those K bit positions to 1.
  4. To check an item: hash it with all K functions, check if ALL K positions are 1.
    • If any position is 0 → the item is definitely NOT in the set.
    • If all positions are 1 → the item is probably in the set (could be a false positive from other items setting those bits).
Bit array (M = 16):  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Add "apple" (K=3):   hash1→3, hash2→7, hash3→11
                     [0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0]

Add "banana" (K=3):  hash1→1, hash2→7, hash3→14
                     [0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0]

Check "cherry":      hash1→3, hash2→5, hash3→11
  Position 5 is 0   → "cherry" is DEFINITELY NOT in the set ✓

Check "date":        hash1→1, hash2→7, hash3→11
  All positions are 1 → "date" is PROBABLY in the set (false positive!)

Implementation Pseudocode

class BloomFilter:
    def __init__(self, expected_items, fp_rate=0.01):
        # Calculate optimal size
        self.size = int(-expected_items * ln(fp_rate) / (ln(2)**2))
        self.num_hashes = int((self.size / expected_items) * ln(2))
        self.bit_array = [0] * self.size

    def add(self, item):
        for seed in range(self.num_hashes):
            index = hash(item, seed) % self.size
            self.bit_array[index] = 1

    def might_contain(self, item):
        for seed in range(self.num_hashes):
            index = hash(item, seed) % self.size
            if self.bit_array[index] == 0:
                return False   # Definitely NOT in set
        return True            # Probably in set

# Usage:
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)
bf.add("user:123")
bf.add("user:456")
bf.might_contain("user:123")  # True (correct)
bf.might_contain("user:789")  # False (correct) or True (false positive)

Tuning: False Positive Rate vs Memory

Items (N)False Positive RateBits Needed (M)MemoryHash Functions (K)
1 million1%~9.6M bits~1.2 MB7
1 million0.1%~14.4M bits~1.8 MB10
1 billion1%~9.6B bits~1.2 GB7
1 billion0.01%~19.2B bits~2.4 GB13

The formulas:

Optimal bit array size:  M = -(N × ln(p)) / (ln(2))²
Optimal hash functions:  K = (M / N) × ln(2)

Where:
  N = expected number of items
  p = desired false positive rate
  M = number of bits in the array
  K = number of hash functions

Rule of thumb: ~10 bits per item gives ~1% FP rate

[!IMPORTANT] Standard Bloom filters cannot delete items. Once a bit is set, you cannot know if other items also depend on that bit. If you need deletion, use a Counting Bloom Filter (each position is a counter instead of a bit) or a Cuckoo Filter.

Bloom Filter Variants

VariantKey FeatureTradeoffUse Case
Counting Bloom FilterSupports deletion (counters instead of bits)4× more memoryDynamic sets with frequent additions/removals
Scalable Bloom FilterGrows automatically as items are addedSlightly higher FP rateWhen you don''t know the expected item count upfront
Cuckoo FilterSupports deletion + lower FP at same memoryMore complex implementationModern replacement for Bloom filters
Quotient FilterCache-friendly, supports deletion and mergingMore complexSSD-based storage engines

Other Probabilistic Data Structures

HyperLogLog (Cardinality Estimation)

Answers: "Approximately how many unique items are in this data set?" Using only 12 KB of memory, HyperLogLog can estimate the number of unique elements in a set of billions with ~0.81% error. Redis has built-in HyperLogLog commands (PFADD, PFCOUNT).

# Redis HyperLogLog example:
PFADD visitors "user:123" "user:456" "user:123"  # 123 counted once
PFCOUNT visitors → 2

# Even with 100 million unique visitors:
# Memory usage: still just 12 KB!
  • Use case: Counting unique visitors to a website, unique search queries per day, unique IPs accessing a service.

Count-Min Sketch (Frequency Estimation)

Answers: "Approximately how many times has this item appeared?" It uses multiple hash functions mapping to a 2D array of counters. It may over-count (never under-counts).

Count-Min Sketch (d=3 hash functions, w=8 columns):

       Col 0  Col 1  Col 2  Col 3  Col 4  Col 5  Col 6  Col 7
Hash1:   0      3      0      0      7      0      0      1
Hash2:   0      0      5      0      0      3      0      0
Hash3:   2      0      0      3      0      0      0      0

Query "trending_topic": hash1→4, hash2→2, hash3→3
  Counts: 7, 5, 3
  Estimate: min(7, 5, 3) = 3  (close to actual count)
  • Use case: Identifying trending hashtags on Twitter, detecting network traffic anomalies, tracking most-played songs on Spotify, finding heavy hitters in a data stream.

When to Use Which Data Structure

Question You Need AnsweredData StructureMemoryError Type
"Is this item in the set?"Bloom Filter~1.2 GB / 1B itemsFalse positives only
"How many unique items?"HyperLogLog12 KB (fixed!)±0.81% error
"How often did this item appear?"Count-Min SketchConfigurableOver-counts only
"What are the top-K items?"Count-Min Sketch + HeapConfigurableApproximate ranking
"Is this item in the set? (with deletion)"Cuckoo FilterSimilar to BloomFalse positives only

Real-World Usage

Google Chrome Safe Browsing

Chrome checks every URL you visit against a database of known malicious URLs. Instead of sending every URL to Google (privacy nightmare) or downloading the entire blacklist (too large), Chrome uses a Bloom filter stored locally. If the Bloom filter says "no" → the URL is safe (no network request needed). If the Bloom filter says "maybe" → Chrome sends a hash prefix to Google''s server for a definitive check. This reduces network requests by over 99%.

URL Visit Flow:
  URL → Local Bloom Filter
    → "Definitely safe" (99% of URLs) → Allow immediately
    → "Maybe malicious" (~1%) → Query Google server → Confirm/Deny

Cassandra (SSTable Lookup)

Cassandra stores data in immutable files called SSTables. When a read comes in, Cassandra might need to check multiple SSTables to find the data. Each SSTable has an associated Bloom filter. Before doing an expensive disk read, Cassandra checks the Bloom filter: "Could this key be in this SSTable?" If the Bloom filter says no, it skips that file entirely, saving significant I/O.

Impact: Without Bloom filters, a read might check 10 SSTables (10 disk reads). With Bloom filters, it typically checks only 1–2 (the ones that actually contain the key).

Medium (Article Recommendation Dedup)

Medium uses Bloom filters to avoid recommending articles a user has already read. Instead of querying the database for every candidate article ("has user X read article Y?"), they check a per-user Bloom filter. The occasional false positive (skipping an unread article) is acceptable—showing a duplicate is worse.

Bitcoin SPV Nodes

Lightweight Bitcoin nodes (SPV

  • Simplified Payment Verification) use Bloom filters to request only relevant transactions from full nodes, without revealing exactly which addresses they own. This provides a balance between privacy and bandwidth efficiency.

Common Mistakes

  • ❌ Confusing false positives with false negatives — Bloom filters never produce false negatives. If it says "no," trust it absolutely.
  • ❌ Trying to delete from a standard Bloom filter — use a Counting Bloom Filter or Cuckoo Filter instead.
  • ❌ Using a Bloom filter when exact membership is required — if you cannot tolerate any false positives (e.g., financial dedup), use a proper HashSet or database lookup.
  • ❌ Not sizing the filter correctly — if the actual number of items exceeds your initial estimate, the false positive rate increases dramatically. Use a Scalable Bloom Filter if uncertain.
  • ❌ Using too few hash functions — the optimal K is (M/N) × ln(2). Too few or too many hash functions increase the FP rate.

[!TIP] Key Takeaways:
• Bloom filters trade a small false positive rate for massive memory savings — 1 billion items in ~1.2 GB vs ~20 GB for a HashSet.
• No false negatives: "definitely not in the set" is always correct.
• ~10 bits per item gives ~1% FP rate. Optimal K ≈ (M/N) × ln(2).
• HyperLogLog counts unique items in 12 KB. Count-Min Sketch estimates frequencies.
• Cuckoo Filters are the modern replacement when you need deletion support.
• In interviews, mention Bloom filters for web crawlers (URL dedup), databases (SSTable lookup), and caches (avoiding unnecessary lookups).

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

mediumSystem Design
Quiz: Bloom Filters
5 questions5 min

0 comments

Please login to comment.
No comments yet.
Lesson 2 of 12 in 5. Advanced Concepts & Patterns
Previous in 5. Advanced Concepts & Patterns
17. Consistent Hashing
Next in 5. Advanced Concepts & Patterns
19. Database Replication
Back to System Design: The Complete Guide
Back to moduleCategories