Web Crawler — System Design · Engineering Playbook

Step 1 — Clarify Requirements#

Functional

Given a set of seed URLs, discover and download web pages reachable from them.
Respect robots.txt, host-level politeness (don’t hammer one host), and crawl-delay directives.
Extract content (text, links, structured data) from HTML.
Deduplicate URLs and detect near-duplicate content.
Re-crawl pages periodically; prioritize freshness based on per-page change frequency.
Out of scope: the search ranking that follows.

Non-functional

99.9% availability (crawler downtime is recoverable; bounded backlog).
Throughput target: download ~1 B pages per day. (Production at Google or Bing is higher; 1 B/day fits the design we want to teach.)
Storage: petabytes of crawled HTML plus extracted content.
Politeness is non-negotiable — getting banned by a host kills future coverage.

Step 2 — Capacity Estimation#

Pages crawled per day: 1 B → ~12 K pages/sec average, 30 K/sec peak.
Per-page size: HTML averages 200 KB compressed; plus images / assets we ignore. Net storage: 1 B × 200 KB = 200 TB/day raw HTML.
Indexed content (text-only after stripping): ~30 KB/page → 30 TB/day.
URL frontier size: distinct URLs known to exist ~1 T (1 trillion). Stored as a sharded queue / KV.
DNS lookups: ~10 K/sec, mostly absorbed by a local DNS cache hitting upstream resolvers.
Hosts crawled concurrently: with 1 RPS per host politeness, 30 K pages/sec means 30 K distinct active hosts per second. Total hosts ever crawled: 100 M+; active at any moment: a few million.

The two design pressures: don’t lose URLs from the frontier (durability) and don’t hammer any single host (politeness).

Step 3 — System Interface#

The crawler doesn’t expose a public API — it’s a pipeline. Internal interfaces:

// URL frontier
enqueue(url, priority, source)
dequeue() → url
register_robots(host, rules)
register_dns(host, ip)

// Fetcher
fetch(url) → { status, headers, body, fetched_at }

// Content extractor
extract(html) → { text, links: [url], canonical, language }

// Dedup / sketch store
fingerprint(content) → simhash
seen(simhash) → bool

Step 4 — High-Level Design#

            seed URLs
                │
                ▼
  ┌───────────────────────────────┐
  │   URL frontier (sharded by    │
  │   host hash; per-host queue)  │
  └───────────────┬───────────────┘
                  │
                  ▼
        URL dispatcher ── obeys per-host politeness; emits ready URLs
                  │
                  ▼
        ┌────────────────┐
        │ Fetcher fleet  │── DNS cache, robots cache, conditional GETs
        └────────┬───────┘
                 │
                 ▼
        Content store (S3-class, partitioned by host + crawl_date)
                 │
                 ▼
        ┌────────────────┐
        │ Extractor pool │── parse HTML, extract links + text + metadata
        └────────┬───────┘
                 │
            ┌────┴─────┐
            │          │
   new URLs back   text + links → indexer (out of scope)
   into frontier

Two loops:

Discovery: pages → links → new URLs → frontier.
Re-crawl: scheduled by per-URL freshness policy; the same URL re-enters the frontier periodically.

Step 5 — Data Model#

URL frontier (sharded by host hash, per-host FIFO queue):

key: frontier:host:{host_hash}:queue → list of urls
key: frontier:host:{host_hash}:last_fetch_ts → timestamp
key: frontier:host:{host_hash}:rps_limit  → from robots.txt or default

The dispatcher only emits a URL from a host queue if now > last_fetch + 1/rps_limit.

Seen-URL index (Bloom filter + KV for false-positive resolution):

url → simhash(url)   // canonical form first
seen bloom (50 GB for 1 T URLs at 0.1% FP)
on bloom hit: check KV for definitive yes/no

Content store (object store, content-addressed):

key: content/<host>/<url_hash>/<crawl_date>.gz → raw HTML
metadata: { fetched_at, etag, last_modified, simhash }

Per-URL state:

table url_state
  url_hash      bytes    PK
  first_seen    timestamp
  last_crawled  timestamp
  next_due_at   timestamp    // freshness scheduler
  change_rate   float        // estimated; tunes recrawl interval
  status        enum(active, dead, blocked_by_robots)

Step 6 — Detailed Design#

URL frontier and politeness#

The frontier is a sharded queue with per-host sub-queues. Why per-host? Because politeness is host-scoped — we cap RPS per host (typically 1 RPS or whatever robots.txt says), and queuing per-host makes it natural to dispatch one URL at a time per host.

dispatcher loop:
   for each host_shard in my_assignment:
      for each host in host_shard:
         if now < last_fetch[host] + cooldown[host]: continue
         url = pop(frontier:host[host])
         if url is None: continue
         emit(url) → fetcher
         last_fetch[host] = now

A host with millions of URLs is the slowest to crawl. We can’t speed it up without violating politeness (or having a private agreement with the host operator). Some hosts publish Crawl-Delay longer than 1 s; we obey.

Fetcher#

fetch(url):
   resolve DNS (cached)
   fetch robots.txt if stale (per-host cache, 24h TTL)
   if disallowed: skip, mark url_state.status = blocked_by_robots
   open HTTPS connection (pooled by host)
   issue GET with conditional headers: If-None-Match (ETag), If-Modified-Since
   on 304: bump last_crawled, no content store write
   on 200: stream body, gzip, write to content store

Conditional GETs (If-None-Match) save 30-50% of bandwidth on re-crawls. A host that respects ETags lets us “re-crawl” cheaply.

Deduplication#

Three levels:

URL normalization: lowercase host, strip default ports, sort query parameters, collapse trailing slashes. Two URLs with the same canonical form are the same URL.
URL deduplication: a Bloom filter checks “have we seen this URL?”. The frontier doesn’t re-enqueue Bloom-hits.
Content deduplication: simhash (or minhash) of the page content. Two pages with simhash distance ≤ K (say, 3) are near-duplicates — we keep one canonical copy.

Content dedup matters because mirror sites, syndication, and CMSs produce billions of byte-different but semantically-identical pages.

Crawl traps#

Some URL patterns are infinite (?session=... calendars going year by year forever, ?page=N listings without bound). Detection:

Bounded depth: don’t follow links beyond depth 30 from any seed.
Per-host page cap: at most X distinct URLs per host (configurable, e.g., 10 M).
Heuristic detection: if 95% of URLs from a host are near-duplicates by content, deprioritize the host’s depth.

Freshness scheduling#

How often to re-crawl a URL?

next_due_at = last_crawled + f(change_rate, importance)

change_rate is estimated from recent re-crawls:
  if last 3 re-crawls returned 304s, double the interval (max 30 days)
  if last 3 re-crawls returned new content, halve the interval (min 1 hour)
importance is from inbound link count, page rank, etc.

A news site’s homepage might be re-crawled every 10 minutes; a small Wikipedia page once per month. The freshness scheduler is itself a system (see /system-design/distributed-task-scheduler) emitting URLs back into the frontier.

robots.txt and ethics#

Cache robots.txt per-host with 24 h TTL.
Respect Crawl-Delay, Disallow, Allow.
Respect noindex meta tags and X-Robots-Tag headers when ingesting.
Identify with a clear User-Agent and a contact URL — so host operators can email if our crawler misbehaves.

Distributed coordination#

The crawler runs on hundreds of machines. URL assignment uses consistent hashing by host: each host’s frontier is owned by exactly one shard. When a fetcher discovers a new link to a host owned by a different shard, it enqueues remotely (RPC) to that shard.

This keeps politeness state local (the one shard that owns a host is the only one fetching from it).

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: per-host throughput. 1 RPS per host × 10 K active hosts = 10 K pages/sec. To hit 30 K pages/sec, we need 30 K active hosts simultaneously — which means breadth of host coverage matters as much as depth. A crawler stuck on a few mega-hosts (Wikipedia, GitHub) can’t hit throughput targets.

Bottleneck #2: the seen-URL index. A Bloom filter for 1 trillion URLs at 0.1% false-positive rate needs ~14 bits per item = ~1.7 TB of RAM, sharded. Workable, but a real cost. Tiered Bloom (a small in-memory recent layer + a larger on-disk historical layer) is the production shape.

Bottleneck #3: content extraction CPU. Parsing 30 K pages/sec through an HTML parser, JavaScript renderer (for SPAs), and link extractor costs significant CPU. For JS-heavy sites, headless browser rendering (Puppeteer, Chrome DevTools Protocol) is 100× more expensive than parsing static HTML. Most crawlers reserve JS rendering for a tagged subset of high-value sites.

Alternative I’d push back on: a single global queue ordered by priority. Every dispatcher would contend for the same queue; per-host politeness can’t be enforced cleanly. Per-host sub-queues with consistent-hash assignment is the right structure.

What breaks first at 10× scale (10 B pages/day): the link graph. Storing every (source, target) pair for 10× pages is 10× the storage; updates to PageRank-class metrics become harder to keep current. Solution: compute aggregates rather than store the raw graph; tier the link table by inbound count (the top 100 M nodes get full storage, the long tail gets sampled).

Companies this resembles#

Googlebot, Bingbot, Common Crawl (open dataset), Internet Archive’s Heritrix. Inside specific companies, smaller crawlers exist for product price comparison, news monitoring, security scanning.

Distributed Search — what consumes the crawler’s output.
Distributed Task Scheduler — the freshness scheduler.
Blob Store — content storage substrate.
Rate Limiter — same primitive used for politeness.