32. Design a Web Crawler

[!NOTE] A web crawler is the backbone of every search engine. Google''s crawler, Googlebot, discovers and indexes hundreds of billions of web pages. Designing a web crawler tests your understanding of distributed systems, queuing, deduplication, and politeness constraints. It is a classic system design interview question that goes beyond simple CRUD.

Step 1: Requirements

Aspect	Requirement
Purpose	Crawl the web to build a search index
Scale	1 billion pages per month (~400 pages/second)
Content types	HTML only (for simplicity; real crawlers handle PDFs, images, etc.)
Politeness	Respect robots.txt and rate-limit per domain
Deduplication	Avoid re-crawling identical content
Freshness	Re-crawl popular/changing pages more frequently

Step 2: High-Level Design (v1)

  [Seed URLs] → [URL Frontier (Queue)]
                       │
                       ▼
                [Fetcher (HTTP GET)]
                       │
                       ▼
                [HTML Parser] → Extract links → [URL Frontier]
                       │
                       ▼
                [Content Store] → [Search Indexer]

This is the basic crawl loop: take a URL from the queue, fetch the page, parse it for links, add new links to the queue, store the content.

Step 3: URL Frontier (Priority Queue)

The URL frontier is not a simple FIFO queue. It must handle:

Prioritization: Important pages (e.g., news sites, high-PageRank domains) should be crawled first and more frequently.
Politeness: Don''t hammer a single domain. Ensure a delay between requests to the same host.

URL Frontier Architecture:

  Incoming URLs
       │
   [Priority Queue]  (rank URLs by importance)
       │
   [Politeness Router]  (group URLs by domain)
       │
   ┌───┴──────────────────────┐
   │  Domain Queue: nytimes.com │ → [Fetcher Thread 1]
   │  Domain Queue: github.com  │ → [Fetcher Thread 2]
   │  Domain Queue: reddit.com  │ → [Fetcher Thread 3]
   │  ...                       │
   └────────────────────────────┘
   
   Each domain queue has a rate limiter (e.g., 1 req/sec per domain)

Politeness is critical. Without it, your crawler will be blocked by websites, and you may be violating their terms of service. Always check robots.txt before crawling any page.

Step 4: Deduplication

The web has massive duplication. The same content appears at different URLs (www vs non-www, HTTP vs HTTPS, URL parameters). Two levels of dedup:

URL Deduplication

Before adding a URL to the frontier, check if it has already been seen. Use a Bloom filter (from Chapter 18) for space-efficient "seen" checks:

URL → hash → check Bloom filter
  → Probably seen? Skip.
  → Definitely not seen? Add to frontier.

With 1 billion URLs, a Bloom filter with 1% false positive rate needs only ~1.2 GB of memory.

Content Deduplication

Even with different URLs, pages may have identical content (mirrors, syndication). Use a content fingerprint:

Compute a hash (e.g., SHA-256) of the page content after stripping boilerplate (headers, footers, ads).
Compare against a set of seen content hashes.
If duplicate → skip indexing. If unique → store and index.

For near-duplicate detection (pages that are 90% similar), use SimHash — a locality-sensitive hash where similar documents produce similar hashes. Google uses SimHash to detect near-duplicate web pages.

Step 5: Distributed Architecture (v2)

                    ┌──────────────┐
                    │  Seed URLs    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │  URL Frontier │
                    │  (Kafka/Redis)│
                    └──────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
  [Fetcher 1]        [Fetcher 2]        [Fetcher N]
  (datacenter A)     (datacenter B)     (datacenter C)
        │                  │                  │
        └──────────────────┼──────────────────┘
                           │
                    ┌──────▼───────┐
                    │  HTML Parser  │
                    │  (workers)    │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        [Content Store] [URL Dedup]  [Link Extractor]
        (S3 / HDFS)   (Bloom Filter) (→ URL Frontier)
              │
        [Search Indexer]

Partitioning Strategy

Distribute URLs across fetcher workers by domain hash: all URLs for nytimes.com go to the same fetcher. This ensures politeness (one fetcher controls the rate per domain) and enables local DNS caching.

Step 6: Handling Edge Cases

Problem	Solution
Spider traps (infinite loops of generated URLs)	Set a maximum URL depth limit (e.g., 15 levels). Detect and blacklist trap domains.
Dynamic content (JavaScript-rendered pages)	Use headless browsers (Puppeteer/Playwright) for critical pages, or rely on server-side rendering detection.
robots.txt changes	Cache robots.txt per domain with a TTL (e.g., 24 hours). Re-fetch periodically.
Large files (videos, binaries)	Set a maximum download size. Check Content-Type header before downloading body.

URL Frontier: Priority + Politeness

URL Frontier Architecture:

  [Incoming URLs]
       ↓
  [Prioritizer]  ← Assigns priority based on:
       |            • PageRank score
       |            • Domain authority
       |            • How recently the page changed
       |            • Content freshness (RSS, sitemap lastmod)
       ↓
  ┌─────────────────────────┐
  │ Priority Queues         │
  │ P0: Breaking news sites │  ← Crawl every 5 minutes
  │ P1: Popular blogs       │  ← Crawl every hour
  │ P2: Static docs         │  ← Crawl daily
  │ P3: Low-traffic sites   │  ← Crawl weekly
  └─────────┬───────────────┘
            ↓
  [Politeness Router]
       ↓
  ┌─────────────────────────┐
  │ Per-Domain Queues       │  ← One queue per domain
  │ example.com: [url1, ..] │     ensures we never send
  │ blog.dev:    [url3, ..] │     concurrent requests
  │ news.org:    [url5, ..] │     to the same domain
  └─────────────────────────┘

Spider Trap Detection

Spider traps are URLs that generate infinite pages (e.g., calendar with infinite next/prev links). Strategies to handle them:

Strategy	How It Works	Example
Depth limit	Stop crawling after N hops from seed	Max depth = 15 from root domain
URL length limit	Reject URLs longer than N characters	Max 200 chars (traps often have very long URLs)
Pattern detection	Detect repeating URL segments	`/a/b/a/b/a/b/` → trap
Domain page cap	Limit pages crawled per domain	Max 10K pages per domain per crawl cycle
Manual blacklist	Operators flag known trap domains	Infinite calendar, generated query params

Crawl Freshness: When to Re-Crawl

Re-crawl strategy based on page change frequency:

Check HTTP "Last-Modified" or "ETag" headers→ If unchanged, skip download (saves bandwidth)

Track change history per page:Changed 10 times in 30 days → re-crawl dailyChanged 1 time in 90 days  → re-crawl monthlyNever changed              → re-crawl quarterly

Use sitemaps: sites publish sitemap.xml with<lastmod> timestamps → only crawl changed pages

Common Mistakes

❌ Ignoring robots.txt — ethical and legal requirement. Respect crawl delays and disallow rules.
❌ No URL normalization — http://example.com/page, https://www.example.com/page/, and https://example.com/page?ref=twitter may be the same page. Normalize before dedup.
❌ Crawling without rate limiting — hammering a server with 100 req/sec will get you blocked and potentially cause outages.
❌ Storing all pages equally — prioritize indexing high-quality, frequently-changing pages. Use a freshness score.
❌ No spider trap detection — without depth limits and pattern detection, a single trap site can consume all your crawler resources.

[!TIP] Key Takeaways:
• Core loop: URL frontier → fetch → parse → extract links → frontier.
• URL frontier: priority queue + politeness (per-domain rate limiting + robots.txt).
• Dedup: Bloom filter for URL dedup. SimHash for near-duplicate content detection.
• Spider trap protection: depth limits, URL length limits, pattern detection.
• Re-crawl strategy: use Last-Modified/ETag headers and change frequency tracking.
• Distribute by domain hash for politeness and DNS cache efficiency.