Code Deployment System

Pipelines, artifact stores, environment promotion, canary / blue-green / rolling, rollback.

System Intermediate
7 min read
cicd deployment canary rollback
Companies this resembles: Spinnaker · Argo CD · GitLab · AWS CodeDeploy

Step 1 — Clarify Requirements#

Functional

  • A developer pushes code; the system builds an artifact and ships it to production through staged environments.
  • Multiple deployment strategies: rolling, blue/green, canary, feature-flag.
  • Automated rollback on health degradation.
  • Approval gates between environments where required.
  • Multi-service deployments with topological dependency awareness.
  • Out of scope here: the actual build/CI runners; secret management (assumed available).

Non-functional

  • 99.9% availability — outage of the deployment system blocks releases but doesn’t break production. Cross-region active-active is a stretch goal.
  • p99 deploy duration: full production rollout under 30 minutes for a service of ~1000 instances.
  • Rollback within 5 minutes from “decided to roll back” to “all instances on prior version”.
  • Auditability: every deploy must be reproducible from git SHA + config snapshot.

Step 2 — Capacity Estimation#

  • Deploys per day: a large org (say, ~1,000 engineers) ships ~5,000 deploys/day across ~500 services → ~50 deploys/min average, ~200/min peak.
  • Artifacts stored: 5,000/day × 1 year retention × ~500 MB each = ~900 TB/year of artifacts.
  • Concurrent active deploys: ~100 at peak; each touches ~50-1000 instances.
  • Instance updates/sec at peak: 100 deploys × 10 batches × 100 instances / 5 min interval = ~3,000 instance starts/sec. (Each is a container pull + start; the artifact CDN matters.)
  • Pipeline state: ~10 KB per pipeline × 100 active = ~1 MB live; historical ~1 TB/year.

Scale is modest in raw bytes; the hard parts are coordination, correctness, and safety.

Step 3 — System Interface#

POST /pipelines { source_repo, target_envs, strategy }
POST /deploys { pipeline_id, artifact_id, strategy_override? }
POST /rollback { deploy_id }
GET /deploys/:id (status, instances, health)
GET /pipelines/:id/runs?limit=20
POST /webhooks/health (called by health-checker; triggers auto-rollback)
POST /webhooks/scm (git push triggers pipeline)
GET /audit?service=X&since=... (full deployment history)

The system is event-driven: SCM pushes trigger pipelines; pipelines emit events; deploys watch events and progress.

Step 4 — High-Level Design#

developer git push
SCM webhook ─→ pipeline orchestrator ─→ build farm ─→ artifact store
│ │
▼ ▼
deploy orchestrator artifact CDN (regional caches)
│ ▲
▼ │
target environments (per-region clusters) ────┘
health monitor ─→ auto-rollback decisions
audit + state store

The deploy orchestrator is the moving piece. It owns:

  • The state machine for “is this deploy in progress, paused, succeeded, failed”.
  • The batching policy (“update 5% of instances, wait 5 min, evaluate health, then continue”).
  • The rollback decision logic.

Step 5 — Data Model#

Pipelines (configuration):

table pipelines
pipeline_id uuid PK
service_name string
source_repo string
envs list<{name, region, cluster}>
strategy enum(rolling, blue_green, canary)
approval_gates list<env>
health_checks list<HealthCheck>

Deploys (runtime instances):

table deploys
deploy_id uuid PK
pipeline_id uuid
artifact_id string // SHA of the artifact
current_env string
state enum(pending, in_progress, paused, succeeded, failed, rolled_back)
started_at timestamp
completed_at timestamp?
batches list<Batch> // each batch's instance count, status
metrics json // health snapshots per batch

Artifacts:

table artifacts
artifact_id string PK // content hash
service string
git_sha string
built_at timestamp
size_bytes bigint
blob_uri string // pointer into object store
metadata json // build env, dependency versions

Health observations:

table deploy_health
deploy_id uuid
batch_idx int
ts timestamp
metrics json // error_rate, p99_latency, saturation

Step 6 — Detailed Design#

Deployment strategies#

Rolling — replace N instances at a time with the new version. Cheap (no double capacity); slow rollback (must reverse-roll). Good for stateless services.
Blue/Green — stand up an entirely new fleet (green) alongside the old (blue), cut over traffic at the LB, keep blue warm for ~30 min for instant rollback. Expensive (2× capacity transiently); fast rollback (flip the LB back).

A third option: canary. Send a small fraction of traffic (1%, then 5%, then 25%) to the new version, watch metrics, abort or promote. Best blast-radius control; longer total deploy time.

Most production systems use canary inside a rolling deploy — each batch is a small percentage of total capacity, and the first few batches act as canaries before the rest follow.

The deploy state machine#

PENDING → BUILDING → ARTIFACT_READY → DEPLOYING_BATCH_1 → HEALTH_OK?
├─ yes → DEPLOYING_BATCH_2 → ...
└─ no → ROLLING_BACK → ROLLED_BACK

Each transition is durable. A crash of the deploy orchestrator must be recoverable — the next instance picks up the deploy at the last persisted state. Implementation: a workflow engine (Temporal, Argo Workflows, or an in-house equivalent).

Health monitoring and auto-rollback#

After each batch, the orchestrator queries:

  • Error rate: 5xx rate per service, compared to pre-deploy baseline.
  • Latency: p99 latency vs baseline.
  • Saturation: CPU, memory, connection pool — early signal of capacity regressions.
  • Business metrics: optional, e.g., conversion rate, signup success.

Each metric has a delta threshold (no more than +20% error rate, no more than +50ms p99). If any check fails for K consecutive samples in a batch, abort and roll back.

Rollback path#

For rolling: the previous artifact_id is the rollback target. Re-deploy it with the same strategy, in reverse order so the most-recently-updated batches roll first.

For blue/green: flip the LB weight back to the old environment. Already running; just a config change.

For canary: stop the canary, redirect 100% to the old version.

Cross-region rollouts#

Production usually has ~5-20 regions. Deploys promote through them with a strategy:

1. canary-region (e.g., us-east-2 with 5% of global traffic): 30 min bake.
2. one-per-continent: parallel deploys to us-west-2, eu-west-1, ap-southeast-1.
3. remaining regions: parallel.

The promotion logic is itself a config in the pipeline. The orchestrator enforces inter-region gates.

Artifact distribution#

For a global deploy, the artifact must be in every region’s container registry / blob store. Push artifacts to a primary store; replicate to regional caches on first build. Deploys pull from the local regional cache.

A naive central registry can saturate during a global rollout (1000 instances × 500 MB artifact = 500 GB of pulls from one origin in ~5 minutes). Regional caches absorb this.

Multi-service deploys#

Microservices have dependencies: deploy A before B before C. The orchestrator reads a service dependency DAG and respects it:

deploy(service A) blocks deploy(service B) where B depends on A.

In practice, contract evolution (API backwards-compat for one version) lets the order be looser, but the strict-DAG case is the worst case.

Audit and reproducibility#

Every deploy persists:

  • git SHA of the source.
  • artifact_id (content hash) of the binary.
  • Config snapshot at deploy time.
  • Deployer (user / service account).
  • Approval chain (who approved promotion).

Auditors can re-derive “what was running at time T” from this log.

Step 7 — Evaluation & Trade-offs#

Bottleneck #1: artifact-pull bandwidth during global rollouts. A 500 MB artifact pulled by 10 K instances simultaneously is 5 TB of bandwidth in minutes. Regional caches mandatory; per-instance image-layer dedup (Docker-style) helps reduce per-pull bytes by 10× for incremental updates.

Bottleneck #2: health-check fan-in. The orchestrator polls health for every deploy. 100 active deploys × 10 batches × per-second health = 1,000 queries/sec to the metrics system. Trivial in isolation but adds up — push-based health (the cluster signals the orchestrator) scales better than pull.

Bottleneck #3: workflow engine durability. Persisting every state transition for every deploy is significant write load. Use a workflow engine with batched writes (Temporal) or a queue-backed state machine. Don’t write per-instance updates — that’s per-batch granular at most.

Alternative I’d push back on: “deploy by SSH to each instance and run a script”. Works at single-team scale; doesn’t scale to multi-region, doesn’t audit, doesn’t auto-rollback. Even small companies need a real CD system from day one; trying to grow into one mid-flight is painful.

What breaks first at 10× scale: the orchestrator’s per-deploy state. At 1,000 concurrent deploys, even durable workflow engines start to need sharded leadership. Per-pipeline sharding is the natural seam; co-locate per-service deploys on a single shard so dependency-DAG enforcement stays simple.

Companies this resembles#

Spinnaker (Netflix’s open-source CD), Argo CD / Argo Rollouts (Kubernetes-native), GitLab CI/CD, AWS CodeDeploy, internal systems at Google (Rapid), Meta (Conveyor), Stripe (Henson). Cousins: Terraform Cloud / Atlantis for infra deploys, kubectl rollout for raw k8s.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.