Code Deployment System — System Design

Step 1 — Clarify Requirements#

Functional

A developer pushes code; the system builds an artifact and ships it to production through staged environments.
Multiple deployment strategies: rolling, blue/green, canary, feature-flag.
Automated rollback on health degradation.
Approval gates between environments where required.
Multi-service deployments with topological dependency awareness.
Out of scope here: the actual build/CI runners; secret management (assumed available).

Non-functional

99.9% availability — outage of the deployment system blocks releases but doesn’t break production. Cross-region active-active is a stretch goal.
p99 deploy duration: full production rollout under 30 minutes for a service of ~1000 instances.
Rollback within 5 minutes from “decided to roll back” to “all instances on prior version”.
Auditability: every deploy must be reproducible from git SHA + config snapshot.

Step 2 — Capacity Estimation#

Deploys per day: a large org (say, ~1,000 engineers) ships ~5,000 deploys/day across ~500 services → ~50 deploys/min average, ~200/min peak.
Artifacts stored: 5,000/day × 1 year retention × ~500 MB each = ~900 TB/year of artifacts.
Concurrent active deploys: ~100 at peak; each touches ~50-1000 instances.
Instance updates/sec at peak: 100 deploys × 10 batches × 100 instances / 5 min interval = ~3,000 instance starts/sec. (Each is a container pull + start; the artifact CDN matters.)
Pipeline state: ~10 KB per pipeline × 100 active = ~1 MB live; historical ~1 TB/year.

Scale is modest in raw bytes; the hard parts are coordination, correctness, and safety.

Step 3 — System Interface#

POST  /pipelines                      { source_repo, target_envs, strategy }
POST  /deploys                        { pipeline_id, artifact_id, strategy_override? }
POST  /rollback                       { deploy_id }

GET   /deploys/:id                    (status, instances, health)
GET   /pipelines/:id/runs?limit=20

POST  /webhooks/health                (called by health-checker; triggers auto-rollback)
POST  /webhooks/scm                   (git push triggers pipeline)

GET   /audit?service=X&since=...      (full deployment history)

The system is event-driven: SCM pushes trigger pipelines; pipelines emit events; deploys watch events and progress.

Step 4 — High-Level Design#

  developer git push
        │
        ▼
   SCM webhook ─→ pipeline orchestrator ─→ build farm ─→ artifact store
                          │                                    │
                          ▼                                    ▼
                  deploy orchestrator             artifact CDN (regional caches)
                          │                                    ▲
                          ▼                                    │
                  target environments (per-region clusters) ────┘
                          │
                          ▼
                  health monitor ─→ auto-rollback decisions
                          │
                          ▼
                  audit + state store

The deploy orchestrator is the moving piece. It owns:

The state machine for “is this deploy in progress, paused, succeeded, failed”.
The batching policy (“update 5% of instances, wait 5 min, evaluate health, then continue”).
The rollback decision logic.

Step 5 — Data Model#

Pipelines (configuration):

table pipelines
  pipeline_id    uuid     PK
  service_name   string
  source_repo    string
  envs           list<{name, region, cluster}>
  strategy       enum(rolling, blue_green, canary)
  approval_gates list<env>
  health_checks  list<HealthCheck>

Deploys (runtime instances):

table deploys
  deploy_id      uuid       PK
  pipeline_id    uuid
  artifact_id    string     // SHA of the artifact
  current_env    string
  state          enum(pending, in_progress, paused, succeeded, failed, rolled_back)
  started_at     timestamp
  completed_at   timestamp?
  batches        list<Batch>     // each batch's instance count, status
  metrics        json            // health snapshots per batch

Artifacts:

table artifacts
  artifact_id   string     PK   // content hash
  service       string
  git_sha       string
  built_at      timestamp
  size_bytes    bigint
  blob_uri      string         // pointer into object store
  metadata      json            // build env, dependency versions

Health observations:

table deploy_health
  deploy_id     uuid
  batch_idx     int
  ts            timestamp
  metrics       json          // error_rate, p99_latency, saturation

Step 6 — Detailed Design#

Deployment strategies#

Rolling — replace N instances at a time with the new version. Cheap (no double capacity); slow rollback (must reverse-roll). Good for stateless services.

Blue/Green — stand up an entirely new fleet (green) alongside the old (blue), cut over traffic at the LB, keep blue warm for ~30 min for instant rollback. Expensive (2× capacity transiently); fast rollback (flip the LB back).

A third option: canary. Send a small fraction of traffic (1%, then 5%, then 25%) to the new version, watch metrics, abort or promote. Best blast-radius control; longer total deploy time.

Most production systems use canary inside a rolling deploy — each batch is a small percentage of total capacity, and the first few batches act as canaries before the rest follow.

The deploy state machine#

PENDING → BUILDING → ARTIFACT_READY → DEPLOYING_BATCH_1 → HEALTH_OK?
                                                              │
                                                              ├─ yes → DEPLOYING_BATCH_2 → ...
                                                              │
                                                              └─ no → ROLLING_BACK → ROLLED_BACK

Each transition is durable. A crash of the deploy orchestrator must be recoverable — the next instance picks up the deploy at the last persisted state. Implementation: a workflow engine (Temporal, Argo Workflows, or an in-house equivalent).

Health monitoring and auto-rollback#

After each batch, the orchestrator queries:

Error rate: 5xx rate per service, compared to pre-deploy baseline.
Latency: p99 latency vs baseline.
Saturation: CPU, memory, connection pool — early signal of capacity regressions.
Business metrics: optional, e.g., conversion rate, signup success.

Each metric has a delta threshold (no more than +20% error rate, no more than +50ms p99). If any check fails for K consecutive samples in a batch, abort and roll back.

Rollback path#

For rolling: the previous artifact_id is the rollback target. Re-deploy it with the same strategy, in reverse order so the most-recently-updated batches roll first.

For blue/green: flip the LB weight back to the old environment. Already running; just a config change.

For canary: stop the canary, redirect 100% to the old version.

Cross-region rollouts#

Production usually has ~5-20 regions. Deploys promote through them with a strategy:

1. canary-region (e.g., us-east-2 with 5% of global traffic): 30 min bake.
2. one-per-continent: parallel deploys to us-west-2, eu-west-1, ap-southeast-1.
3. remaining regions: parallel.

The promotion logic is itself a config in the pipeline. The orchestrator enforces inter-region gates.

Artifact distribution#

For a global deploy, the artifact must be in every region’s container registry / blob store. Push artifacts to a primary store; replicate to regional caches on first build. Deploys pull from the local regional cache.

A naive central registry can saturate during a global rollout (1000 instances × 500 MB artifact = 500 GB of pulls from one origin in ~5 minutes). Regional caches absorb this.

Multi-service deploys#

Microservices have dependencies: deploy A before B before C. The orchestrator reads a service dependency DAG and respects it:

deploy(service A) blocks deploy(service B) where B depends on A.

In practice, contract evolution (API backwards-compat for one version) lets the order be looser, but the strict-DAG case is the worst case.

Audit and reproducibility#

Every deploy persists:

git SHA of the source.
artifact_id (content hash) of the binary.
Config snapshot at deploy time.
Deployer (user / service account).
Approval chain (who approved promotion).

Auditors can re-derive “what was running at time T” from this log.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: artifact-pull bandwidth during global rollouts. A 500 MB artifact pulled by 10 K instances simultaneously is 5 TB of bandwidth in minutes. Regional caches mandatory; per-instance image-layer dedup (Docker-style) helps reduce per-pull bytes by 10× for incremental updates.

Bottleneck #2: health-check fan-in. The orchestrator polls health for every deploy. 100 active deploys × 10 batches × per-second health = 1,000 queries/sec to the metrics system. Trivial in isolation but adds up — push-based health (the cluster signals the orchestrator) scales better than pull.

Bottleneck #3: workflow engine durability. Persisting every state transition for every deploy is significant write load. Use a workflow engine with batched writes (Temporal) or a queue-backed state machine. Don’t write per-instance updates — that’s per-batch granular at most.

Alternative I’d push back on: “deploy by SSH to each instance and run a script”. Works at single-team scale; doesn’t scale to multi-region, doesn’t audit, doesn’t auto-rollback. Even small companies need a real CD system from day one; trying to grow into one mid-flight is painful.

What breaks first at 10× scale: the orchestrator’s per-deploy state. At 1,000 concurrent deploys, even durable workflow engines start to need sharded leadership. Per-pipeline sharding is the natural seam; co-locate per-service deploys on a single shard so dependency-DAG enforcement stays simple.

Companies this resembles#

Spinnaker (Netflix’s open-source CD), Argo CD / Argo Rollouts (Kubernetes-native), GitLab CI/CD, AWS CodeDeploy, internal systems at Google (Rapid), Meta (Conveyor), Stripe (Henson). Cousins: Terraform Cloud / Atlantis for infra deploys, kubectl rollout for raw k8s.

Distributed Task Scheduler — the workflow engine substrate.
Distributed Monitoring — health observations the auto-rollback relies on.
Blob Store — artifact storage substrate.
AWS Kinesis — 2020 us-east-1 outage — postmortem of what happens when control-plane deploys go wrong.