Distributed Task Scheduler
Priority, idempotency, deduplication, retry policies, resource capacity allocation.
Use cases#
A distributed task scheduler runs work at the right time on the right worker, surviving worker failure and worker overload. Different from a plain queue (Distributed Messaging Queue) in three ways: tasks have schedules (now, in 5 min, every Tuesday), priorities, and dependencies (run B after A succeeds). The dominant use cases:
- Background jobs from web requests — email send, image resize, billing reconciliation. Sidekiq / Celery / Resque.
- Cron at scale — nightly database backup, weekly report generation. Cron on a single box doesn’t survive box failure.
- ETL pipelines — Airflow DAGs running 1000s of data jobs/day with cross-task dependencies.
- Workflows — checkout flows that span retries, human approvals, compensation logic. Temporal / Cadence.
- Distributed cron + fan-out — at midnight, send “your subscription renewed” emails to 10M users.
Functional requirements#
- Submit tasks with optional
(run_at, priority, retries, timeout, idempotency_key, depends_on). - Schedule recurring tasks via cron expressions or interval definitions.
- Distribute tasks across a worker pool; each task runs on exactly one worker.
- Retry on failure with backoff; route to DLQ after N attempts.
- Track task state: pending, running, completed, failed, retrying, dead.
- Cancel, pause, resume tasks; replay failed ones.
- Inspect a task’s history (which worker, how long, what errored).
Non-functional requirements#
- Scheduling latency: tasks scheduled “now” should run within seconds; tasks scheduled at
tshould run within seconds oft. - Throughput: 1k tasks/sec is small; 100k tasks/sec is a serious system. Sidekiq is documented at 200k jobs/sec on a single Redis cluster.
- Reliability: exactly-once execution is impossible; at-least-once + idempotent task is the practical answer.
- Availability: 99.95% minimum. Scheduler down → backlog grows → eventual recovery storm.
- Fairness: low-priority tenants should not starve high-priority ones; one customer’s 10k-task burst should not block another’s 1 task.
High-level design#
client scheduler worker pool ──────── ──────────────── ───────────────── submit task ──> persist to durable store ──> dispatcher ──> worker 1 (Postgres, Redis, selects ready worker 2 Cassandra, Kafka) tasks, leases ... to worker worker N │ ▲ │ └── status updates ────────────────────────────┘
scheduled ack on success recurring tasks ──> cron scanner ──> enqueue when due nack → retry queue (every ~10s)Three subsystems: the store (durable record of every task), the scheduler / dispatcher (decides what to run next, leases to workers), and the workers (pulls leases, executes, reports).
Detailed design#
Durable storage of task state#
A task is a row in a table — Postgres works, Redis works for ephemeral, Cassandra works for very high volume. Schema essentials:
task { id UUID type string // "send_email", "resize_image", "run_dag.step.3" payload JSON state enum { pending, running, succeeded, failed, dead } scheduled_at timestamp attempts int max_attempts int lease_until timestamp // currently-leased worker holds the lease until this time worker_id string idempotency_key string priority int depends_on [task_id] result JSON / blob history [event] // each transition logged}Leasing for at-least-once execution#
When a worker takes a task, it acquires a lease with a TTL (e.g. 5 minutes). The lease is recorded in the task row. If the worker completes within the lease, it marks state=succeeded. If the worker dies or the lease expires, the scheduler re-leases the task to another worker.
// Atomic lease acquisition (SQL):UPDATE tasks SET state = 'running', worker_id = ?, lease_until = NOW() + INTERVAL '5 minutes', attempts = attempts + 1WHERE id = ( SELECT id FROM tasks WHERE state = 'pending' AND scheduled_at <= NOW() ORDER BY priority DESC, scheduled_at ASC FOR UPDATE SKIP LOCKED -- avoids cross-worker contention LIMIT 1)RETURNING *;SKIP LOCKED (Postgres / MySQL 8) is the secret sauce — multiple worker pollers don’t block each other.
Idempotency#
A retry of “send email” must not actually send the email twice. The pattern:
task.idempotency_key = sha256(user_id + email_template + as_of_date)
at worker: if seen_keys.contains(task.idempotency_key) and seen_keys.get(task.idempotency_key) recent: return cached_result else: result = execute_side_effect(task) seen_keys.set(task.idempotency_key, result, ttl=24h) return resultFor non-idempotent operations (charge a card), the downstream API ideally accepts an idempotency key itself — Stripe’s Idempotency-Key header is the canonical model.
Retries and dead-letter#
retry_after(attempt) = min(MAX, BASE * 2^attempt + jitter)Exponential backoff with jitter. After max_attempts, the task transitions to dead. A human (or automation) inspects the DLQ and either fixes-and-redrives or drops.
Errors should be classified at the worker:
RetryableError(temporary) → backoff and retryNonRetryableError(permanent) → fail immediately, no retries (e.g. validation error)Priority and fairness#
A naive priority queue starves low-priority tasks. Two production patterns:
- Weighted fair queueing — each priority level gets a worker share proportional to its weight (e.g. P0=8, P1=4, P2=1). Even under load, P2 still drains.
- Tenant isolation — per-tenant queues with rate limits, so no single tenant can monopolize the fleet. Kafka with tenant-keyed partitioning is a common substrate.
Cron and recurring schedules#
A separate cron scanner runs every N seconds (10 s is typical), evaluates every active cron rule, and enqueues tasks whose next run time has passed. Idempotency keys (task_type + run_at) prevent double-enqueue if the scanner restarts mid-scan.
For distributed cron specifically, leader election (Raft, ZooKeeper, Postgres advisory locks) ensures only one scanner is active at a time.
Workflows and dependencies#
Beyond single tasks, real systems need orchestration:
order_placed (parent) ├── reserve_inventory ── on_success ──┐ ├── charge_card ── on_success ──┼── ship_order ── on_success ── send_receipt └── send_confirmation │ on_failure of any of the above: compensate (refund, release reservation)Temporal / Cadence model this as deterministic workflow code — write a regular function, the runtime persists every step’s input/output, replays on worker crash. Airflow models it as declarative DAGs — describe the graph, the executor runs it.
The hard part is compensation — if step 3 fails after steps 1-2 already had side effects, you need to undo them. Saga patterns and explicit compensation handlers are the standard answer.
Resource capacity allocation#
Workers have limits — CPU, memory, GPU, DB connections. The scheduler should respect them:
- Worker tags — task requires
gpu:true; only GPU-equipped workers can lease it. - Concurrency caps — at most 5 “send_email” tasks running at once cluster-wide.
- Rate limits — at most 100 SMS sends/sec (downstream provider limit).
Trade-offs#
Other axes:
- Pull vs push dispatch — pull (worker polls scheduler) scales linearly, gives natural backpressure; push (scheduler sends to worker) lower latency but harder to load-balance.
- In-process scheduler vs distributed — a single Postgres +
SKIP LOCKEDhandles tens of thousands of tasks/sec on commodity hardware. A separate distributed scheduler is justified beyond that. - Sync execution vs async with poll — fast tasks (
< 1 s) often complete synchronously inside the web request; slower tasks defer to async. The cut-off depends on your latency budget.
Real-world examples#
- Sidekiq — Ruby’s dominant background job framework. Redis-backed, 200k jobs/sec on a single cluster.
- Celery — Python equivalent; supports RabbitMQ, Redis, Kafka backends. Heavy in scientific computing.
- Apache Airflow — DAG-based ETL scheduler; Airbnb invented it (2014), now Apache top-level. Used at most data-team-having companies.
- Temporal — successor to Uber’s Cadence; durable execution as code. Used by Coinbase, Stripe, Box, Snap.
- AWS Step Functions — managed workflow engine; state machine JSON definitions; pay-per-transition.
- Quartz — JVM enterprise scheduler; powers many in-house cron systems.
- GitLab Sidekiq + cron schedulers — runs ~50M jobs/day across the gitlab.com fleet.
- Google Borg / Kubernetes Jobs — at the infrastructure layer; not a task scheduler in the application sense but the dispatch primitive underneath.
Related building blocks#
- Distributed Messaging Queue — the transport layer underneath most schedulers.
- Databases — durable task state lives here.
- Pub-Sub — used to broadcast task lifecycle events.
- Sequencer — generates idempotency keys and trace IDs.