Web Crawler
URL frontier, politeness, deduplication, content extraction, crawl traps, index updates.
Step 1 — Clarify Requirements#
Functional
- Given a set of seed URLs, discover and download web pages reachable from them.
- Respect
robots.txt, host-level politeness (don’t hammer one host), and crawl-delay directives. - Extract content (text, links, structured data) from HTML.
- Deduplicate URLs and detect near-duplicate content.
- Re-crawl pages periodically; prioritize freshness based on per-page change frequency.
- Out of scope: the search ranking that follows.
Non-functional
- 99.9% availability (crawler downtime is recoverable; bounded backlog).
- Throughput target: download ~1 B pages per day. (Production at Google or Bing is higher; 1 B/day fits the design we want to teach.)
- Storage: petabytes of crawled HTML plus extracted content.
- Politeness is non-negotiable — getting banned by a host kills future coverage.
Step 2 — Capacity Estimation#
- Pages crawled per day: 1 B → ~12 K pages/sec average, 30 K/sec peak.
- Per-page size: HTML averages 200 KB compressed; plus images / assets we ignore. Net storage: 1 B × 200 KB = 200 TB/day raw HTML.
- Indexed content (text-only after stripping): ~30 KB/page → 30 TB/day.
- URL frontier size: distinct URLs known to exist ~1 T (1 trillion). Stored as a sharded queue / KV.
- DNS lookups: ~10 K/sec, mostly absorbed by a local DNS cache hitting upstream resolvers.
- Hosts crawled concurrently: with 1 RPS per host politeness, 30 K pages/sec means 30 K distinct active hosts per second. Total hosts ever crawled: 100 M+; active at any moment: a few million.
The two design pressures: don’t lose URLs from the frontier (durability) and don’t hammer any single host (politeness).
Step 3 — System Interface#
The crawler doesn’t expose a public API — it’s a pipeline. Internal interfaces:
// URL frontierenqueue(url, priority, source)dequeue() → urlregister_robots(host, rules)register_dns(host, ip)
// Fetcherfetch(url) → { status, headers, body, fetched_at }
// Content extractorextract(html) → { text, links: [url], canonical, language }
// Dedup / sketch storefingerprint(content) → simhashseen(simhash) → boolStep 4 — High-Level Design#
seed URLs │ ▼ ┌───────────────────────────────┐ │ URL frontier (sharded by │ │ host hash; per-host queue) │ └───────────────┬───────────────┘ │ ▼ URL dispatcher ── obeys per-host politeness; emits ready URLs │ ▼ ┌────────────────┐ │ Fetcher fleet │── DNS cache, robots cache, conditional GETs └────────┬───────┘ │ ▼ Content store (S3-class, partitioned by host + crawl_date) │ ▼ ┌────────────────┐ │ Extractor pool │── parse HTML, extract links + text + metadata └────────┬───────┘ │ ┌────┴─────┐ │ │ new URLs back text + links → indexer (out of scope) into frontierTwo loops:
- Discovery: pages → links → new URLs → frontier.
- Re-crawl: scheduled by per-URL freshness policy; the same URL re-enters the frontier periodically.
Step 5 — Data Model#
URL frontier (sharded by host hash, per-host FIFO queue):
key: frontier:host:{host_hash}:queue → list of urlskey: frontier:host:{host_hash}:last_fetch_ts → timestampkey: frontier:host:{host_hash}:rps_limit → from robots.txt or defaultThe dispatcher only emits a URL from a host queue if now > last_fetch + 1/rps_limit.
Seen-URL index (Bloom filter + KV for false-positive resolution):
url → simhash(url) // canonical form firstseen bloom (50 GB for 1 T URLs at 0.1% FP)on bloom hit: check KV for definitive yes/noContent store (object store, content-addressed):
key: content/<host>/<url_hash>/<crawl_date>.gz → raw HTMLmetadata: { fetched_at, etag, last_modified, simhash }Per-URL state:
table url_state url_hash bytes PK first_seen timestamp last_crawled timestamp next_due_at timestamp // freshness scheduler change_rate float // estimated; tunes recrawl interval status enum(active, dead, blocked_by_robots)Step 6 — Detailed Design#
URL frontier and politeness#
The frontier is a sharded queue with per-host sub-queues. Why per-host? Because politeness is host-scoped — we cap RPS per host (typically 1 RPS or whatever robots.txt says), and queuing per-host makes it natural to dispatch one URL at a time per host.
dispatcher loop: for each host_shard in my_assignment: for each host in host_shard: if now < last_fetch[host] + cooldown[host]: continue url = pop(frontier:host[host]) if url is None: continue emit(url) → fetcher last_fetch[host] = nowA host with millions of URLs is the slowest to crawl. We can’t speed it up without violating politeness (or having a private agreement with the host operator). Some hosts publish Crawl-Delay longer than 1 s; we obey.
Fetcher#
fetch(url): resolve DNS (cached) fetch robots.txt if stale (per-host cache, 24h TTL) if disallowed: skip, mark url_state.status = blocked_by_robots open HTTPS connection (pooled by host) issue GET with conditional headers: If-None-Match (ETag), If-Modified-Since on 304: bump last_crawled, no content store write on 200: stream body, gzip, write to content storeConditional GETs (If-None-Match) save 30-50% of bandwidth on re-crawls. A host that respects ETags lets us “re-crawl” cheaply.
Deduplication#
Three levels:
- URL normalization: lowercase host, strip default ports, sort query parameters, collapse trailing slashes. Two URLs with the same canonical form are the same URL.
- URL deduplication: a Bloom filter checks “have we seen this URL?”. The frontier doesn’t re-enqueue Bloom-hits.
- Content deduplication: simhash (or minhash) of the page content. Two pages with simhash distance ≤ K (say, 3) are near-duplicates — we keep one canonical copy.
Content dedup matters because mirror sites, syndication, and CMSs produce billions of byte-different but semantically-identical pages.
Crawl traps#
Some URL patterns are infinite (?session=... calendars going year by year forever, ?page=N listings without bound). Detection:
- Bounded depth: don’t follow links beyond depth 30 from any seed.
- Per-host page cap: at most X distinct URLs per host (configurable, e.g., 10 M).
- Heuristic detection: if 95% of URLs from a host are near-duplicates by content, deprioritize the host’s depth.
Freshness scheduling#
How often to re-crawl a URL?
next_due_at = last_crawled + f(change_rate, importance)
change_rate is estimated from recent re-crawls: if last 3 re-crawls returned 304s, double the interval (max 30 days) if last 3 re-crawls returned new content, halve the interval (min 1 hour)importance is from inbound link count, page rank, etc.A news site’s homepage might be re-crawled every 10 minutes; a small Wikipedia page once per month. The freshness scheduler is itself a system (see /system-design/distributed-task-scheduler) emitting URLs back into the frontier.
robots.txt and ethics#
- Cache robots.txt per-host with 24 h TTL.
- Respect
Crawl-Delay,Disallow,Allow. - Respect
noindexmeta tags andX-Robots-Tagheaders when ingesting. - Identify with a clear
User-Agentand a contact URL — so host operators can email if our crawler misbehaves.
Distributed coordination#
The crawler runs on hundreds of machines. URL assignment uses consistent hashing by host: each host’s frontier is owned by exactly one shard. When a fetcher discovers a new link to a host owned by a different shard, it enqueues remotely (RPC) to that shard.
This keeps politeness state local (the one shard that owns a host is the only one fetching from it).
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: per-host throughput. 1 RPS per host × 10 K active hosts = 10 K pages/sec. To hit 30 K pages/sec, we need 30 K active hosts simultaneously — which means breadth of host coverage matters as much as depth. A crawler stuck on a few mega-hosts (Wikipedia, GitHub) can’t hit throughput targets.
Bottleneck #2: the seen-URL index. A Bloom filter for 1 trillion URLs at 0.1% false-positive rate needs ~14 bits per item = ~1.7 TB of RAM, sharded. Workable, but a real cost. Tiered Bloom (a small in-memory recent layer + a larger on-disk historical layer) is the production shape.
Bottleneck #3: content extraction CPU. Parsing 30 K pages/sec through an HTML parser, JavaScript renderer (for SPAs), and link extractor costs significant CPU. For JS-heavy sites, headless browser rendering (Puppeteer, Chrome DevTools Protocol) is 100× more expensive than parsing static HTML. Most crawlers reserve JS rendering for a tagged subset of high-value sites.
Alternative I’d push back on: a single global queue ordered by priority. Every dispatcher would contend for the same queue; per-host politeness can’t be enforced cleanly. Per-host sub-queues with consistent-hash assignment is the right structure.
What breaks first at 10× scale (10 B pages/day): the link graph. Storing every (source, target) pair for 10× pages is 10× the storage; updates to PageRank-class metrics become harder to keep current. Solution: compute aggregates rather than store the raw graph; tier the link table by inbound count (the top 100 M nodes get full storage, the long tail gets sampled).
Companies this resembles#
Googlebot, Bingbot, Common Crawl (open dataset), Internet Archive’s Heritrix. Inside specific companies, smaller crawlers exist for product price comparison, news monitoring, security scanning.
Related systems#
- Distributed Search — what consumes the crawler’s output.
- Distributed Task Scheduler — the freshness scheduler.
- Blob Store — content storage substrate.
- Rate Limiter — same primitive used for politeness.