Web Crawler

URL frontier, politeness, deduplication, content extraction, crawl traps, index updates.

System Advanced
8 min read
crawling frontier dedup politeness
Companies this resembles: Googlebot · Bingbot · Common Crawl · Internet Archive

Step 1 — Clarify Requirements#

Functional

  • Given a set of seed URLs, discover and download web pages reachable from them.
  • Respect robots.txt, host-level politeness (don’t hammer one host), and crawl-delay directives.
  • Extract content (text, links, structured data) from HTML.
  • Deduplicate URLs and detect near-duplicate content.
  • Re-crawl pages periodically; prioritize freshness based on per-page change frequency.
  • Out of scope: the search ranking that follows.

Non-functional

  • 99.9% availability (crawler downtime is recoverable; bounded backlog).
  • Throughput target: download ~1 B pages per day. (Production at Google or Bing is higher; 1 B/day fits the design we want to teach.)
  • Storage: petabytes of crawled HTML plus extracted content.
  • Politeness is non-negotiable — getting banned by a host kills future coverage.

Step 2 — Capacity Estimation#

  • Pages crawled per day: 1 B → ~12 K pages/sec average, 30 K/sec peak.
  • Per-page size: HTML averages 200 KB compressed; plus images / assets we ignore. Net storage: 1 B × 200 KB = 200 TB/day raw HTML.
  • Indexed content (text-only after stripping): ~30 KB/page → 30 TB/day.
  • URL frontier size: distinct URLs known to exist ~1 T (1 trillion). Stored as a sharded queue / KV.
  • DNS lookups: ~10 K/sec, mostly absorbed by a local DNS cache hitting upstream resolvers.
  • Hosts crawled concurrently: with 1 RPS per host politeness, 30 K pages/sec means 30 K distinct active hosts per second. Total hosts ever crawled: 100 M+; active at any moment: a few million.

The two design pressures: don’t lose URLs from the frontier (durability) and don’t hammer any single host (politeness).

Step 3 — System Interface#

The crawler doesn’t expose a public API — it’s a pipeline. Internal interfaces:

// URL frontier
enqueue(url, priority, source)
dequeue() → url
register_robots(host, rules)
register_dns(host, ip)
// Fetcher
fetch(url) → { status, headers, body, fetched_at }
// Content extractor
extract(html) → { text, links: [url], canonical, language }
// Dedup / sketch store
fingerprint(content) → simhash
seen(simhash) → bool

Step 4 — High-Level Design#

seed URLs
┌───────────────────────────────┐
│ URL frontier (sharded by │
│ host hash; per-host queue) │
└───────────────┬───────────────┘
URL dispatcher ── obeys per-host politeness; emits ready URLs
┌────────────────┐
│ Fetcher fleet │── DNS cache, robots cache, conditional GETs
└────────┬───────┘
Content store (S3-class, partitioned by host + crawl_date)
┌────────────────┐
│ Extractor pool │── parse HTML, extract links + text + metadata
└────────┬───────┘
┌────┴─────┐
│ │
new URLs back text + links → indexer (out of scope)
into frontier

Two loops:

  • Discovery: pages → links → new URLs → frontier.
  • Re-crawl: scheduled by per-URL freshness policy; the same URL re-enters the frontier periodically.

Step 5 — Data Model#

URL frontier (sharded by host hash, per-host FIFO queue):

key: frontier:host:{host_hash}:queue → list of urls
key: frontier:host:{host_hash}:last_fetch_ts → timestamp
key: frontier:host:{host_hash}:rps_limit → from robots.txt or default

The dispatcher only emits a URL from a host queue if now > last_fetch + 1/rps_limit.

Seen-URL index (Bloom filter + KV for false-positive resolution):

url → simhash(url) // canonical form first
seen bloom (50 GB for 1 T URLs at 0.1% FP)
on bloom hit: check KV for definitive yes/no

Content store (object store, content-addressed):

key: content/<host>/<url_hash>/<crawl_date>.gz → raw HTML
metadata: { fetched_at, etag, last_modified, simhash }

Per-URL state:

table url_state
url_hash bytes PK
first_seen timestamp
last_crawled timestamp
next_due_at timestamp // freshness scheduler
change_rate float // estimated; tunes recrawl interval
status enum(active, dead, blocked_by_robots)

Step 6 — Detailed Design#

URL frontier and politeness#

The frontier is a sharded queue with per-host sub-queues. Why per-host? Because politeness is host-scoped — we cap RPS per host (typically 1 RPS or whatever robots.txt says), and queuing per-host makes it natural to dispatch one URL at a time per host.

dispatcher loop:
for each host_shard in my_assignment:
for each host in host_shard:
if now < last_fetch[host] + cooldown[host]: continue
url = pop(frontier:host[host])
if url is None: continue
emit(url) → fetcher
last_fetch[host] = now

A host with millions of URLs is the slowest to crawl. We can’t speed it up without violating politeness (or having a private agreement with the host operator). Some hosts publish Crawl-Delay longer than 1 s; we obey.

Fetcher#

fetch(url):
resolve DNS (cached)
fetch robots.txt if stale (per-host cache, 24h TTL)
if disallowed: skip, mark url_state.status = blocked_by_robots
open HTTPS connection (pooled by host)
issue GET with conditional headers: If-None-Match (ETag), If-Modified-Since
on 304: bump last_crawled, no content store write
on 200: stream body, gzip, write to content store

Conditional GETs (If-None-Match) save 30-50% of bandwidth on re-crawls. A host that respects ETags lets us “re-crawl” cheaply.

Deduplication#

Three levels:

  1. URL normalization: lowercase host, strip default ports, sort query parameters, collapse trailing slashes. Two URLs with the same canonical form are the same URL.
  2. URL deduplication: a Bloom filter checks “have we seen this URL?”. The frontier doesn’t re-enqueue Bloom-hits.
  3. Content deduplication: simhash (or minhash) of the page content. Two pages with simhash distance ≤ K (say, 3) are near-duplicates — we keep one canonical copy.

Content dedup matters because mirror sites, syndication, and CMSs produce billions of byte-different but semantically-identical pages.

Crawl traps#

Some URL patterns are infinite (?session=... calendars going year by year forever, ?page=N listings without bound). Detection:

  • Bounded depth: don’t follow links beyond depth 30 from any seed.
  • Per-host page cap: at most X distinct URLs per host (configurable, e.g., 10 M).
  • Heuristic detection: if 95% of URLs from a host are near-duplicates by content, deprioritize the host’s depth.

Freshness scheduling#

How often to re-crawl a URL?

next_due_at = last_crawled + f(change_rate, importance)
change_rate is estimated from recent re-crawls:
if last 3 re-crawls returned 304s, double the interval (max 30 days)
if last 3 re-crawls returned new content, halve the interval (min 1 hour)
importance is from inbound link count, page rank, etc.

A news site’s homepage might be re-crawled every 10 minutes; a small Wikipedia page once per month. The freshness scheduler is itself a system (see /system-design/distributed-task-scheduler) emitting URLs back into the frontier.

robots.txt and ethics#

  • Cache robots.txt per-host with 24 h TTL.
  • Respect Crawl-Delay, Disallow, Allow.
  • Respect noindex meta tags and X-Robots-Tag headers when ingesting.
  • Identify with a clear User-Agent and a contact URL — so host operators can email if our crawler misbehaves.

Distributed coordination#

The crawler runs on hundreds of machines. URL assignment uses consistent hashing by host: each host’s frontier is owned by exactly one shard. When a fetcher discovers a new link to a host owned by a different shard, it enqueues remotely (RPC) to that shard.

This keeps politeness state local (the one shard that owns a host is the only one fetching from it).

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: per-host throughput. 1 RPS per host × 10 K active hosts = 10 K pages/sec. To hit 30 K pages/sec, we need 30 K active hosts simultaneously — which means breadth of host coverage matters as much as depth. A crawler stuck on a few mega-hosts (Wikipedia, GitHub) can’t hit throughput targets.

Bottleneck #2: the seen-URL index. A Bloom filter for 1 trillion URLs at 0.1% false-positive rate needs ~14 bits per item = ~1.7 TB of RAM, sharded. Workable, but a real cost. Tiered Bloom (a small in-memory recent layer + a larger on-disk historical layer) is the production shape.

Bottleneck #3: content extraction CPU. Parsing 30 K pages/sec through an HTML parser, JavaScript renderer (for SPAs), and link extractor costs significant CPU. For JS-heavy sites, headless browser rendering (Puppeteer, Chrome DevTools Protocol) is 100× more expensive than parsing static HTML. Most crawlers reserve JS rendering for a tagged subset of high-value sites.

Alternative I’d push back on: a single global queue ordered by priority. Every dispatcher would contend for the same queue; per-host politeness can’t be enforced cleanly. Per-host sub-queues with consistent-hash assignment is the right structure.

What breaks first at 10× scale (10 B pages/day): the link graph. Storing every (source, target) pair for 10× pages is 10× the storage; updates to PageRank-class metrics become harder to keep current. Solution: compute aggregates rather than store the raw graph; tier the link table by inbound count (the top 100 M nodes get full storage, the long tail gets sampled).

Companies this resembles#

Googlebot, Bingbot, Common Crawl (open dataset), Internet Archive’s Heritrix. Inside specific companies, smaller crawlers exist for product price comparison, news monitoring, security scanning.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.