Skip to content
QuizMaker logoQuizMaker
Activity
System Design: The Complete Guide
6. More Case Studies
1. Introduction to System Design
2. Vertical vs Horizontal Scaling
3. Load Balancing
4. Caching Strategies
5. CDNs (Content Delivery Networks)
6. SQL vs NoSQL
7. Database Sharding & Partitioning
8. The CAP Theorem
9. Microservices Architecture
10. Message Queues & Event Streaming
12. Design BookMyShow (Ticket Booking)
14. Design Dropbox (Cloud File Storage)
15. How to Approach Any System Design Interview
16. Back-of-the-Envelope Estimation
17. Consistent Hashing
18. Bloom Filters & Probabilistic Data Structures
19. Database Replication
20. Leader Election & Consensus (Raft & Paxos)
21. Distributed Transactions (Saga, 2PC, Outbox)
22. Event Sourcing & CQRS
23. Unique ID Generation at Scale
24. Rate Limiting Algorithms
25. Circuit Breakers & Bulkhead Pattern
26. API Gateway, Proxies & Service Mesh
27. Real-Time Communication
28. Observability (Tracing, Logging, SLOs)
30. Design a Chat System (WhatsApp)
31. Design YouTube (Video Streaming)
32. Design a Web Crawler
CONTENTS

32. Design a Web Crawler

How search engines discover and index the web — URL frontier, politeness, deduplication, and distributed crawling.

Mar 5, 20265 views0 likes0 fires
18px

[!NOTE] A web crawler is the backbone of every search engine. Google''s crawler, Googlebot, discovers and indexes hundreds of billions of web pages. Designing a web crawler tests your understanding of distributed systems, queuing, deduplication, and politeness constraints. It is a classic system design interview question that goes beyond simple CRUD.

Step 1: Requirements

AspectRequirement
PurposeCrawl the web to build a search index
Scale1 billion pages per month (~400 pages/second)
Content typesHTML only (for simplicity; real crawlers handle PDFs, images, etc.)
PolitenessRespect robots.txt and rate-limit per domain
DeduplicationAvoid re-crawling identical content
FreshnessRe-crawl popular/changing pages more frequently

Step 2: High-Level Design (v1)

  [Seed URLs] → [URL Frontier (Queue)]
                       │
                       ▼
                [Fetcher (HTTP GET)]
                       │
                       ▼
                [HTML Parser] → Extract links → [URL Frontier]
                       │
                       ▼
                [Content Store] → [Search Indexer]

This is the basic crawl loop: take a URL from the queue, fetch the page, parse it for links, add new links to the queue, store the content.

Step 3: URL Frontier (Priority Queue)

The URL frontier is not a simple FIFO queue. It must handle:

  • Prioritization: Important pages (e.g., news sites, high-PageRank domains) should be crawled first and more frequently.
  • Politeness: Don''t hammer a single domain. Ensure a delay between requests to the same host.
URL Frontier Architecture:

  Incoming URLs
       │
   [Priority Queue]  (rank URLs by importance)
       │
   [Politeness Router]  (group URLs by domain)
       │
   ┌───┴──────────────────────┐
   │  Domain Queue: nytimes.com │ → [Fetcher Thread 1]
   │  Domain Queue: github.com  │ → [Fetcher Thread 2]
   │  Domain Queue: reddit.com  │ → [Fetcher Thread 3]
   │  ...                       │
   └────────────────────────────┘
   
   Each domain queue has a rate limiter (e.g., 1 req/sec per domain)

Politeness is critical. Without it, your crawler will be blocked by websites, and you may be violating their terms of service. Always check robots.txt before crawling any page.

Step 4: Deduplication

The web has massive duplication. The same content appears at different URLs (www vs non-www, HTTP vs HTTPS, URL parameters). Two levels of dedup:

URL Deduplication

Before adding a URL to the frontier, check if it has already been seen. Use a Bloom filter (from Chapter 18) for space-efficient "seen" checks:

URL → hash → check Bloom filter
  → Probably seen? Skip.
  → Definitely not seen? Add to frontier.

With 1 billion URLs, a Bloom filter with 1% false positive rate needs only ~1.2 GB of memory.

Content Deduplication

Even with different URLs, pages may have identical content (mirrors, syndication). Use a content fingerprint:

  1. Compute a hash (e.g., SHA-256) of the page content after stripping boilerplate (headers, footers, ads).
  2. Compare against a set of seen content hashes.
  3. If duplicate → skip indexing. If unique → store and index.

For near-duplicate detection (pages that are 90% similar), use SimHash — a locality-sensitive hash where similar documents produce similar hashes. Google uses SimHash to detect near-duplicate web pages.

Step 5: Distributed Architecture (v2)

                    ┌──────────────┐
                    │  Seed URLs    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │  URL Frontier │
                    │  (Kafka/Redis)│
                    └──────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
  [Fetcher 1]        [Fetcher 2]        [Fetcher N]
  (datacenter A)     (datacenter B)     (datacenter C)
        │                  │                  │
        └──────────────────┼──────────────────┘
                           │
                    ┌──────▼───────┐
                    │  HTML Parser  │
                    │  (workers)    │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        [Content Store] [URL Dedup]  [Link Extractor]
        (S3 / HDFS)   (Bloom Filter) (→ URL Frontier)
              │
        [Search Indexer]

Partitioning Strategy

Distribute URLs across fetcher workers by domain hash: all URLs for nytimes.com go to the same fetcher. This ensures politeness (one fetcher controls the rate per domain) and enables local DNS caching.

Step 6: Handling Edge Cases

ProblemSolution
Spider traps (infinite loops of generated URLs)Set a maximum URL depth limit (e.g., 15 levels). Detect and blacklist trap domains.
Dynamic content (JavaScript-rendered pages)Use headless browsers (Puppeteer/Playwright) for critical pages, or rely on server-side rendering detection.
robots.txt changesCache robots.txt per domain with a TTL (e.g., 24 hours). Re-fetch periodically.
Large files (videos, binaries)Set a maximum download size. Check Content-Type header before downloading body.

URL Frontier: Priority + Politeness

URL Frontier Architecture:

  [Incoming URLs]
       ↓
  [Prioritizer]  ← Assigns priority based on:
       |            • PageRank score
       |            • Domain authority
       |            • How recently the page changed
       |            • Content freshness (RSS, sitemap lastmod)
       ↓
  ┌─────────────────────────┐
  │ Priority Queues         │
  │ P0: Breaking news sites │  ← Crawl every 5 minutes
  │ P1: Popular blogs       │  ← Crawl every hour
  │ P2: Static docs         │  ← Crawl daily
  │ P3: Low-traffic sites   │  ← Crawl weekly
  └─────────┬───────────────┘
            ↓
  [Politeness Router]
       ↓
  ┌─────────────────────────┐
  │ Per-Domain Queues       │  ← One queue per domain
  │ example.com: [url1, ..] │     ensures we never send
  │ blog.dev:    [url3, ..] │     concurrent requests
  │ news.org:    [url5, ..] │     to the same domain
  └─────────────────────────┘

Spider Trap Detection

Spider traps are URLs that generate infinite pages (e.g., calendar with infinite next/prev links). Strategies to handle them:

StrategyHow It WorksExample
Depth limitStop crawling after N hops from seedMax depth = 15 from root domain
URL length limitReject URLs longer than N charactersMax 200 chars (traps often have very long URLs)
Pattern detectionDetect repeating URL segments/a/b/a/b/a/b/ → trap
Domain page capLimit pages crawled per domainMax 10K pages per domain per crawl cycle
Manual blacklistOperators flag known trap domainsInfinite calendar, generated query params

Crawl Freshness: When to Re-Crawl

Re-crawl strategy based on page change frequency:

Check HTTP "Last-Modified" or "ETag" headers→ If unchanged, skip download (saves bandwidth)

Track change history per page:Changed 10 times in 30 days → re-crawl dailyChanged 1 time in 90 days  → re-crawl monthlyNever changed              → re-crawl quarterly

Use sitemaps: sites publish sitemap.xml with<lastmod> timestamps → only crawl changed pages


Common Mistakes

  • ❌ Ignoring robots.txt — ethical and legal requirement. Respect crawl delays and disallow rules.
  • ❌ No URL normalization — http://example.com/page, https://www.example.com/page/, and https://example.com/page?ref=twitter may be the same page. Normalize before dedup.
  • ❌ Crawling without rate limiting — hammering a server with 100 req/sec will get you blocked and potentially cause outages.
  • ❌ Storing all pages equally — prioritize indexing high-quality, frequently-changing pages. Use a freshness score.
  • ❌ No spider trap detection — without depth limits and pattern detection, a single trap site can consume all your crawler resources.

[!TIP] Key Takeaways:
• Core loop: URL frontier → fetch → parse → extract links → frontier.
• URL frontier: priority queue + politeness (per-domain rate limiting + robots.txt).
• Dedup: Bloom filter for URL dedup. SimHash for near-duplicate content detection.
• Spider trap protection: depth limits, URL length limits, pattern detection.
• Re-crawl strategy: use Last-Modified/ETag headers and change frequency tracking.
• Distribute by domain hash for politeness and DNS cache efficiency.

Share this article

Share on TwitterShare on LinkedInShare on FacebookShare on WhatsAppShare on Email

Test your knowledge

Take a quick quiz based on this chapter.

hardSystem Design
Quiz: Design a Web Crawler
5 questions5 min

Continue Learning

30. Design a Chat System (WhatsApp)

Advanced
18 min

31. Design YouTube (Video Streaming)

Advanced
18 min
Lesson 3 of 3 in 6. More Case Studies
Previous in 6. More Case Studies
31. Design YouTube (Video Streaming)
Completed
You finished this lesson → take the quiz
5 questions • 5 min
← Back to System Design: The Complete Guide
Back to System Design: The Complete GuideAll Categories