Distributed Task Scheduler — System Design

Use cases#

A distributed task scheduler runs work at the right time on the right worker, surviving worker failure and worker overload. Different from a plain queue (Distributed Messaging Queue) in three ways: tasks have schedules (now, in 5 min, every Tuesday), priorities, and dependencies (run B after A succeeds). The dominant use cases:

Background jobs from web requests — email send, image resize, billing reconciliation. Sidekiq / Celery / Resque.
Cron at scale — nightly database backup, weekly report generation. Cron on a single box doesn’t survive box failure.
ETL pipelines — Airflow DAGs running 1000s of data jobs/day with cross-task dependencies.
Workflows — checkout flows that span retries, human approvals, compensation logic. Temporal / Cadence.
Distributed cron + fan-out — at midnight, send “your subscription renewed” emails to 10M users.

Functional requirements#

Submit tasks with optional (run_at, priority, retries, timeout, idempotency_key, depends_on).
Schedule recurring tasks via cron expressions or interval definitions.
Distribute tasks across a worker pool; each task runs on exactly one worker.
Retry on failure with backoff; route to DLQ after N attempts.
Track task state: pending, running, completed, failed, retrying, dead.
Cancel, pause, resume tasks; replay failed ones.
Inspect a task’s history (which worker, how long, what errored).

Non-functional requirements#

Scheduling latency: tasks scheduled “now” should run within seconds; tasks scheduled at t should run within seconds of t.
Throughput: 1k tasks/sec is small; 100k tasks/sec is a serious system. Sidekiq is documented at 200k jobs/sec on a single Redis cluster.
Reliability: exactly-once execution is impossible; at-least-once + idempotent task is the practical answer.
Availability: 99.95% minimum. Scheduler down → backlog grows → eventual recovery storm.
Fairness: low-priority tenants should not starve high-priority ones; one customer’s 10k-task burst should not block another’s 1 task.

High-level design#

   client                    scheduler                          worker pool
  ────────              ────────────────                    ─────────────────
  submit task ──> persist to durable store ──> dispatcher ──> worker 1
                  (Postgres, Redis,             selects ready  worker 2
                   Cassandra, Kafka)            tasks, leases  ...
                                                 to worker     worker N
                                                                  │
                  ▲                                               │
                  └── status updates ────────────────────────────┘

  scheduled                                                 ack on success
  recurring tasks ──> cron scanner ──> enqueue when due     nack → retry queue
                       (every ~10s)

Three subsystems: the store (durable record of every task), the scheduler / dispatcher (decides what to run next, leases to workers), and the workers (pulls leases, executes, reports).

Detailed design#

Durable storage of task state#

A task is a row in a table — Postgres works, Redis works for ephemeral, Cassandra works for very high volume. Schema essentials:

task {
  id            UUID
  type          string         // "send_email", "resize_image", "run_dag.step.3"
  payload       JSON
  state         enum { pending, running, succeeded, failed, dead }
  scheduled_at  timestamp
  attempts      int
  max_attempts  int
  lease_until   timestamp      // currently-leased worker holds the lease until this time
  worker_id     string
  idempotency_key string
  priority      int
  depends_on    [task_id]
  result        JSON / blob
  history       [event]        // each transition logged
}

Leasing for at-least-once execution#

When a worker takes a task, it acquires a lease with a TTL (e.g. 5 minutes). The lease is recorded in the task row. If the worker completes within the lease, it marks state=succeeded. If the worker dies or the lease expires, the scheduler re-leases the task to another worker.

// Atomic lease acquisition (SQL):
UPDATE tasks SET
    state = 'running',
    worker_id = ?,
    lease_until = NOW() + INTERVAL '5 minutes',
    attempts = attempts + 1
WHERE id = (
    SELECT id FROM tasks
    WHERE state = 'pending' AND scheduled_at <= NOW()
    ORDER BY priority DESC, scheduled_at ASC
    FOR UPDATE SKIP LOCKED      -- avoids cross-worker contention
    LIMIT 1
)
RETURNING *;

SKIP LOCKED (Postgres / MySQL 8) is the secret sauce — multiple worker pollers don’t block each other.

Idempotency#

A retry of “send email” must not actually send the email twice. The pattern:

task.idempotency_key = sha256(user_id + email_template + as_of_date)

at worker:
   if seen_keys.contains(task.idempotency_key) and seen_keys.get(task.idempotency_key) recent:
       return cached_result
   else:
       result = execute_side_effect(task)
       seen_keys.set(task.idempotency_key, result, ttl=24h)
       return result

For non-idempotent operations (charge a card), the downstream API ideally accepts an idempotency key itself — Stripe’s Idempotency-Key header is the canonical model.

Retries and dead-letter#

retry_after(attempt) = min(MAX, BASE * 2^attempt + jitter)

Exponential backoff with jitter. After max_attempts, the task transitions to dead. A human (or automation) inspects the DLQ and either fixes-and-redrives or drops.

Errors should be classified at the worker:

RetryableError(temporary)    → backoff and retry
NonRetryableError(permanent) → fail immediately, no retries (e.g. validation error)

Priority and fairness#

A naive priority queue starves low-priority tasks. Two production patterns:

Weighted fair queueing — each priority level gets a worker share proportional to its weight (e.g. P0=8, P1=4, P2=1). Even under load, P2 still drains.
Tenant isolation — per-tenant queues with rate limits, so no single tenant can monopolize the fleet. Kafka with tenant-keyed partitioning is a common substrate.

Cron and recurring schedules#

A separate cron scanner runs every N seconds (10 s is typical), evaluates every active cron rule, and enqueues tasks whose next run time has passed. Idempotency keys (task_type + run_at) prevent double-enqueue if the scanner restarts mid-scan.

For distributed cron specifically, leader election (Raft, ZooKeeper, Postgres advisory locks) ensures only one scanner is active at a time.

Workflows and dependencies#

Beyond single tasks, real systems need orchestration:

order_placed (parent)
  ├── reserve_inventory  ── on_success ──┐
  ├── charge_card        ── on_success ──┼── ship_order ── on_success ── send_receipt
  └── send_confirmation                  │
        on_failure of any of the above: compensate (refund, release reservation)

Temporal / Cadence model this as deterministic workflow code — write a regular function, the runtime persists every step’s input/output, replays on worker crash. Airflow models it as declarative DAGs — describe the graph, the executor runs it.

The hard part is compensation — if step 3 fails after steps 1-2 already had side effects, you need to undo them. Saga patterns and explicit compensation handlers are the standard answer.

Resource capacity allocation#

Workers have limits — CPU, memory, GPU, DB connections. The scheduler should respect them:

Worker tags — task requires gpu:true; only GPU-equipped workers can lease it.
Concurrency caps — at most 5 “send_email” tasks running at once cluster-wide.
Rate limits — at most 100 SMS sends/sec (downstream provider limit).

Trade-offs#

Queue-based (Sidekiq, Celery, Resque) — simple mental model, fast, good for fire-and-forget background jobs. Hard to express dependencies, retries, long-running workflows.

Workflow engines (Temporal, Cadence, Step Functions) — durable execution, native dependency graph, compensation, long-running workflows up to months. Heavier dependency (workflow service + DB); learning curve.

Other axes:

Pull vs push dispatch — pull (worker polls scheduler) scales linearly, gives natural backpressure; push (scheduler sends to worker) lower latency but harder to load-balance.
In-process scheduler vs distributed — a single Postgres + SKIP LOCKED handles tens of thousands of tasks/sec on commodity hardware. A separate distributed scheduler is justified beyond that.
Sync execution vs async with poll — fast tasks (< 1 s) often complete synchronously inside the web request; slower tasks defer to async. The cut-off depends on your latency budget.

Real-world examples#

Sidekiq — Ruby’s dominant background job framework. Redis-backed, 200k jobs/sec on a single cluster.
Celery — Python equivalent; supports RabbitMQ, Redis, Kafka backends. Heavy in scientific computing.
Apache Airflow — DAG-based ETL scheduler; Airbnb invented it (2014), now Apache top-level. Used at most data-team-having companies.
Temporal — successor to Uber’s Cadence; durable execution as code. Used by Coinbase, Stripe, Box, Snap.
AWS Step Functions — managed workflow engine; state machine JSON definitions; pay-per-transition.
Quartz — JVM enterprise scheduler; powers many in-house cron systems.
GitLab Sidekiq + cron schedulers — runs ~50M jobs/day across the gitlab.com fleet.
Google Borg / Kubernetes Jobs — at the infrastructure layer; not a task scheduler in the application sense but the dispatch primitive underneath.

Distributed Messaging Queue — the transport layer underneath most schedulers.
Databases — durable task state lives here.
Pub-Sub — used to broadcast task lifecycle events.
Sequencer — generates idempotency keys and trace IDs.

Use cases#

Functional requirements#

Non-functional requirements#

High-level design#

Detailed design#

Durable storage of task state#

Leasing for at-least-once execution#

Idempotency#

Retries and dead-letter#

Priority and fairness#

Cron and recurring schedules#

Workflows and dependencies#

Resource capacity allocation#

Trade-offs#

Real-world examples#

Related building blocks#