Distributed Task Scheduler

Priority, idempotency, deduplication, retry policies, resource capacity allocation.

Building Block Advanced
7 min read
scheduler jobs workflow
Companies this resembles: Airflow · Temporal · Celery · Sidekiq · Quartz

Use cases#

A distributed task scheduler runs work at the right time on the right worker, surviving worker failure and worker overload. Different from a plain queue (Distributed Messaging Queue) in three ways: tasks have schedules (now, in 5 min, every Tuesday), priorities, and dependencies (run B after A succeeds). The dominant use cases:

  • Background jobs from web requests — email send, image resize, billing reconciliation. Sidekiq / Celery / Resque.
  • Cron at scale — nightly database backup, weekly report generation. Cron on a single box doesn’t survive box failure.
  • ETL pipelines — Airflow DAGs running 1000s of data jobs/day with cross-task dependencies.
  • Workflows — checkout flows that span retries, human approvals, compensation logic. Temporal / Cadence.
  • Distributed cron + fan-out — at midnight, send “your subscription renewed” emails to 10M users.

Functional requirements#

  • Submit tasks with optional (run_at, priority, retries, timeout, idempotency_key, depends_on).
  • Schedule recurring tasks via cron expressions or interval definitions.
  • Distribute tasks across a worker pool; each task runs on exactly one worker.
  • Retry on failure with backoff; route to DLQ after N attempts.
  • Track task state: pending, running, completed, failed, retrying, dead.
  • Cancel, pause, resume tasks; replay failed ones.
  • Inspect a task’s history (which worker, how long, what errored).

Non-functional requirements#

  • Scheduling latency: tasks scheduled “now” should run within seconds; tasks scheduled at t should run within seconds of t.
  • Throughput: 1k tasks/sec is small; 100k tasks/sec is a serious system. Sidekiq is documented at 200k jobs/sec on a single Redis cluster.
  • Reliability: exactly-once execution is impossible; at-least-once + idempotent task is the practical answer.
  • Availability: 99.95% minimum. Scheduler down → backlog grows → eventual recovery storm.
  • Fairness: low-priority tenants should not starve high-priority ones; one customer’s 10k-task burst should not block another’s 1 task.

High-level design#

client scheduler worker pool
──────── ──────────────── ─────────────────
submit task ──> persist to durable store ──> dispatcher ──> worker 1
(Postgres, Redis, selects ready worker 2
Cassandra, Kafka) tasks, leases ...
to worker worker N
▲ │
└── status updates ────────────────────────────┘
scheduled ack on success
recurring tasks ──> cron scanner ──> enqueue when due nack → retry queue
(every ~10s)

Three subsystems: the store (durable record of every task), the scheduler / dispatcher (decides what to run next, leases to workers), and the workers (pulls leases, executes, reports).

Detailed design#

Durable storage of task state#

A task is a row in a table — Postgres works, Redis works for ephemeral, Cassandra works for very high volume. Schema essentials:

task {
id UUID
type string // "send_email", "resize_image", "run_dag.step.3"
payload JSON
state enum { pending, running, succeeded, failed, dead }
scheduled_at timestamp
attempts int
max_attempts int
lease_until timestamp // currently-leased worker holds the lease until this time
worker_id string
idempotency_key string
priority int
depends_on [task_id]
result JSON / blob
history [event] // each transition logged
}

Leasing for at-least-once execution#

When a worker takes a task, it acquires a lease with a TTL (e.g. 5 minutes). The lease is recorded in the task row. If the worker completes within the lease, it marks state=succeeded. If the worker dies or the lease expires, the scheduler re-leases the task to another worker.

// Atomic lease acquisition (SQL):
UPDATE tasks SET
state = 'running',
worker_id = ?,
lease_until = NOW() + INTERVAL '5 minutes',
attempts = attempts + 1
WHERE id = (
SELECT id FROM tasks
WHERE state = 'pending' AND scheduled_at <= NOW()
ORDER BY priority DESC, scheduled_at ASC
FOR UPDATE SKIP LOCKED -- avoids cross-worker contention
LIMIT 1
)
RETURNING *;

SKIP LOCKED (Postgres / MySQL 8) is the secret sauce — multiple worker pollers don’t block each other.

Idempotency#

A retry of “send email” must not actually send the email twice. The pattern:

task.idempotency_key = sha256(user_id + email_template + as_of_date)
at worker:
if seen_keys.contains(task.idempotency_key) and seen_keys.get(task.idempotency_key) recent:
return cached_result
else:
result = execute_side_effect(task)
seen_keys.set(task.idempotency_key, result, ttl=24h)
return result

For non-idempotent operations (charge a card), the downstream API ideally accepts an idempotency key itself — Stripe’s Idempotency-Key header is the canonical model.

Retries and dead-letter#

retry_after(attempt) = min(MAX, BASE * 2^attempt + jitter)

Exponential backoff with jitter. After max_attempts, the task transitions to dead. A human (or automation) inspects the DLQ and either fixes-and-redrives or drops.

Errors should be classified at the worker:

RetryableError(temporary) → backoff and retry
NonRetryableError(permanent) → fail immediately, no retries (e.g. validation error)

Priority and fairness#

A naive priority queue starves low-priority tasks. Two production patterns:

  • Weighted fair queueing — each priority level gets a worker share proportional to its weight (e.g. P0=8, P1=4, P2=1). Even under load, P2 still drains.
  • Tenant isolation — per-tenant queues with rate limits, so no single tenant can monopolize the fleet. Kafka with tenant-keyed partitioning is a common substrate.

Cron and recurring schedules#

A separate cron scanner runs every N seconds (10 s is typical), evaluates every active cron rule, and enqueues tasks whose next run time has passed. Idempotency keys (task_type + run_at) prevent double-enqueue if the scanner restarts mid-scan.

For distributed cron specifically, leader election (Raft, ZooKeeper, Postgres advisory locks) ensures only one scanner is active at a time.

Workflows and dependencies#

Beyond single tasks, real systems need orchestration:

order_placed (parent)
├── reserve_inventory ── on_success ──┐
├── charge_card ── on_success ──┼── ship_order ── on_success ── send_receipt
└── send_confirmation │
on_failure of any of the above: compensate (refund, release reservation)

Temporal / Cadence model this as deterministic workflow code — write a regular function, the runtime persists every step’s input/output, replays on worker crash. Airflow models it as declarative DAGs — describe the graph, the executor runs it.

The hard part is compensation — if step 3 fails after steps 1-2 already had side effects, you need to undo them. Saga patterns and explicit compensation handlers are the standard answer.

Resource capacity allocation#

Workers have limits — CPU, memory, GPU, DB connections. The scheduler should respect them:

  • Worker tags — task requires gpu:true; only GPU-equipped workers can lease it.
  • Concurrency caps — at most 5 “send_email” tasks running at once cluster-wide.
  • Rate limits — at most 100 SMS sends/sec (downstream provider limit).

Trade-offs#

Queue-based (Sidekiq, Celery, Resque) — simple mental model, fast, good for fire-and-forget background jobs. Hard to express dependencies, retries, long-running workflows.
Workflow engines (Temporal, Cadence, Step Functions) — durable execution, native dependency graph, compensation, long-running workflows up to months. Heavier dependency (workflow service + DB); learning curve.

Other axes:

  • Pull vs push dispatch — pull (worker polls scheduler) scales linearly, gives natural backpressure; push (scheduler sends to worker) lower latency but harder to load-balance.
  • In-process scheduler vs distributed — a single Postgres + SKIP LOCKED handles tens of thousands of tasks/sec on commodity hardware. A separate distributed scheduler is justified beyond that.
  • Sync execution vs async with poll — fast tasks (< 1 s) often complete synchronously inside the web request; slower tasks defer to async. The cut-off depends on your latency budget.

Real-world examples#

  • Sidekiq — Ruby’s dominant background job framework. Redis-backed, 200k jobs/sec on a single cluster.
  • Celery — Python equivalent; supports RabbitMQ, Redis, Kafka backends. Heavy in scientific computing.
  • Apache Airflow — DAG-based ETL scheduler; Airbnb invented it (2014), now Apache top-level. Used at most data-team-having companies.
  • Temporal — successor to Uber’s Cadence; durable execution as code. Used by Coinbase, Stripe, Box, Snap.
  • AWS Step Functions — managed workflow engine; state machine JSON definitions; pay-per-transition.
  • Quartz — JVM enterprise scheduler; powers many in-house cron systems.
  • GitLab Sidekiq + cron schedulers — runs ~50M jobs/day across the gitlab.com fleet.
  • Google Borg / Kubernetes Jobs — at the infrastructure layer; not a task scheduler in the application sense but the dispatch primitive underneath.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.