[!NOTE] A web crawler is the backbone of every search engine. Google''s crawler, Googlebot, discovers and indexes hundreds of billions of web pages. Designing a web crawler tests your understanding of distributed systems, queuing, deduplication, and politeness constraints. It is a classic system design interview question that goes beyond simple CRUD.
Step 1: Requirements
| Aspect | Requirement |
|---|---|
| Purpose | Crawl the web to build a search index |
| Scale | 1 billion pages per month (~400 pages/second) |
| Content types | HTML only (for simplicity; real crawlers handle PDFs, images, etc.) |
| Politeness | Respect robots.txt and rate-limit per domain |
| Deduplication | Avoid re-crawling identical content |
| Freshness | Re-crawl popular/changing pages more frequently |
Step 2: High-Level Design (v1)
[Seed URLs] → [URL Frontier (Queue)]
│
▼
[Fetcher (HTTP GET)]
│
▼
[HTML Parser] → Extract links → [URL Frontier]
│
▼
[Content Store] → [Search Indexer]
This is the basic crawl loop: take a URL from the queue, fetch the page, parse it for links, add new links to the queue, store the content.
Step 3: URL Frontier (Priority Queue)
The URL frontier is not a simple FIFO queue. It must handle:
- Prioritization: Important pages (e.g., news sites, high-PageRank domains) should be crawled first and more frequently.
- Politeness: Don''t hammer a single domain. Ensure a delay between requests to the same host.
URL Frontier Architecture:
Incoming URLs
│
[Priority Queue] (rank URLs by importance)
│
[Politeness Router] (group URLs by domain)
│
┌───┴──────────────────────┐
│ Domain Queue: nytimes.com │ → [Fetcher Thread 1]
│ Domain Queue: github.com │ → [Fetcher Thread 2]
│ Domain Queue: reddit.com │ → [Fetcher Thread 3]
│ ... │
└────────────────────────────┘
Each domain queue has a rate limiter (e.g., 1 req/sec per domain)
Politeness is critical. Without it, your crawler will be blocked by websites, and you may be violating their terms of service. Always check robots.txt before crawling any page.
Step 4: Deduplication
The web has massive duplication. The same content appears at different URLs (www vs non-www, HTTP vs HTTPS, URL parameters). Two levels of dedup:
URL Deduplication
Before adding a URL to the frontier, check if it has already been seen. Use a Bloom filter (from Chapter 18) for space-efficient "seen" checks:
URL → hash → check Bloom filter
→ Probably seen? Skip.
→ Definitely not seen? Add to frontier.
With 1 billion URLs, a Bloom filter with 1% false positive rate needs only ~1.2 GB of memory.
Content Deduplication
Even with different URLs, pages may have identical content (mirrors, syndication). Use a content fingerprint:
- Compute a hash (e.g., SHA-256) of the page content after stripping boilerplate (headers, footers, ads).
- Compare against a set of seen content hashes.
- If duplicate → skip indexing. If unique → store and index.
For near-duplicate detection (pages that are 90% similar), use SimHash — a locality-sensitive hash where similar documents produce similar hashes. Google uses SimHash to detect near-duplicate web pages.
Step 5: Distributed Architecture (v2)
┌──────────────┐
│ Seed URLs │
└──────┬───────┘
│
┌──────▼───────┐
│ URL Frontier │
│ (Kafka/Redis)│
└──────┬───────┘
│
┌──────────────────┼──────────────────┐
│ │ │
[Fetcher 1] [Fetcher 2] [Fetcher N]
(datacenter A) (datacenter B) (datacenter C)
│ │ │
└──────────────────┼──────────────────┘
│
┌──────▼───────┐
│ HTML Parser │
│ (workers) │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │ │
[Content Store] [URL Dedup] [Link Extractor]
(S3 / HDFS) (Bloom Filter) (→ URL Frontier)
│
[Search Indexer]
Partitioning Strategy
Distribute URLs across fetcher workers by domain hash: all URLs for nytimes.com go to the same fetcher. This ensures politeness (one fetcher controls the rate per domain) and enables local DNS caching.
Step 6: Handling Edge Cases
| Problem | Solution |
|---|---|
| Spider traps (infinite loops of generated URLs) | Set a maximum URL depth limit (e.g., 15 levels). Detect and blacklist trap domains. |
| Dynamic content (JavaScript-rendered pages) | Use headless browsers (Puppeteer/Playwright) for critical pages, or rely on server-side rendering detection. |
| robots.txt changes | Cache robots.txt per domain with a TTL (e.g., 24 hours). Re-fetch periodically. |
| Large files (videos, binaries) | Set a maximum download size. Check Content-Type header before downloading body. |
URL Frontier: Priority + Politeness
URL Frontier Architecture:
[Incoming URLs]
↓
[Prioritizer] ← Assigns priority based on:
| • PageRank score
| • Domain authority
| • How recently the page changed
| • Content freshness (RSS, sitemap lastmod)
↓
┌─────────────────────────┐
│ Priority Queues │
│ P0: Breaking news sites │ ← Crawl every 5 minutes
│ P1: Popular blogs │ ← Crawl every hour
│ P2: Static docs │ ← Crawl daily
│ P3: Low-traffic sites │ ← Crawl weekly
└─────────┬───────────────┘
↓
[Politeness Router]
↓
┌─────────────────────────┐
│ Per-Domain Queues │ ← One queue per domain
│ example.com: [url1, ..] │ ensures we never send
│ blog.dev: [url3, ..] │ concurrent requests
│ news.org: [url5, ..] │ to the same domain
└─────────────────────────┘
Spider Trap Detection
Spider traps are URLs that generate infinite pages (e.g., calendar with infinite next/prev links). Strategies to handle them:
| Strategy | How It Works | Example |
|---|---|---|
| Depth limit | Stop crawling after N hops from seed | Max depth = 15 from root domain |
| URL length limit | Reject URLs longer than N characters | Max 200 chars (traps often have very long URLs) |
| Pattern detection | Detect repeating URL segments | /a/b/a/b/a/b/ → trap |
| Domain page cap | Limit pages crawled per domain | Max 10K pages per domain per crawl cycle |
| Manual blacklist | Operators flag known trap domains | Infinite calendar, generated query params |
Crawl Freshness: When to Re-Crawl
Re-crawl strategy based on page change frequency:
Check HTTP "Last-Modified" or "ETag" headers→ If unchanged, skip download (saves bandwidth)
Track change history per page:Changed 10 times in 30 days → re-crawl dailyChanged 1 time in 90 days → re-crawl monthlyNever changed → re-crawl quarterly
Use sitemaps: sites publish sitemap.xml with<lastmod> timestamps → only crawl changed pages
Common Mistakes
- ❌ Ignoring robots.txt — ethical and legal requirement. Respect crawl delays and disallow rules.
- ❌ No URL normalization —
http://example.com/page,https://www.example.com/page/, andhttps://example.com/page?ref=twittermay be the same page. Normalize before dedup. - ❌ Crawling without rate limiting — hammering a server with 100 req/sec will get you blocked and potentially cause outages.
- ❌ Storing all pages equally — prioritize indexing high-quality, frequently-changing pages. Use a freshness score.
- ❌ No spider trap detection — without depth limits and pattern detection, a single trap site can consume all your crawler resources.
[!TIP] Key Takeaways:
• Core loop: URL frontier → fetch → parse → extract links → frontier.
• URL frontier: priority queue + politeness (per-domain rate limiting + robots.txt).
• Dedup: Bloom filter for URL dedup. SimHash for near-duplicate content detection.
• Spider trap protection: depth limits, URL length limits, pattern detection.
• Re-crawl strategy: use Last-Modified/ETag headers and change frequency tracking.
• Distribute by domain hash for politeness and DNS cache efficiency.