Code Deployment System
Pipelines, artifact stores, environment promotion, canary / blue-green / rolling, rollback.
Step 1 — Clarify Requirements#
Functional
- A developer pushes code; the system builds an artifact and ships it to production through staged environments.
- Multiple deployment strategies: rolling, blue/green, canary, feature-flag.
- Automated rollback on health degradation.
- Approval gates between environments where required.
- Multi-service deployments with topological dependency awareness.
- Out of scope here: the actual build/CI runners; secret management (assumed available).
Non-functional
- 99.9% availability — outage of the deployment system blocks releases but doesn’t break production. Cross-region active-active is a stretch goal.
- p99 deploy duration: full production rollout under 30 minutes for a service of ~1000 instances.
- Rollback within 5 minutes from “decided to roll back” to “all instances on prior version”.
- Auditability: every deploy must be reproducible from git SHA + config snapshot.
Step 2 — Capacity Estimation#
- Deploys per day: a large org (say, ~1,000 engineers) ships ~5,000 deploys/day across ~500 services → ~50 deploys/min average, ~200/min peak.
- Artifacts stored: 5,000/day × 1 year retention × ~500 MB each = ~900 TB/year of artifacts.
- Concurrent active deploys: ~100 at peak; each touches ~50-1000 instances.
- Instance updates/sec at peak: 100 deploys × 10 batches × 100 instances / 5 min interval = ~3,000 instance starts/sec. (Each is a container pull + start; the artifact CDN matters.)
- Pipeline state: ~10 KB per pipeline × 100 active = ~1 MB live; historical ~1 TB/year.
Scale is modest in raw bytes; the hard parts are coordination, correctness, and safety.
Step 3 — System Interface#
POST /pipelines { source_repo, target_envs, strategy }POST /deploys { pipeline_id, artifact_id, strategy_override? }POST /rollback { deploy_id }
GET /deploys/:id (status, instances, health)GET /pipelines/:id/runs?limit=20
POST /webhooks/health (called by health-checker; triggers auto-rollback)POST /webhooks/scm (git push triggers pipeline)
GET /audit?service=X&since=... (full deployment history)The system is event-driven: SCM pushes trigger pipelines; pipelines emit events; deploys watch events and progress.
Step 4 — High-Level Design#
developer git push │ ▼ SCM webhook ─→ pipeline orchestrator ─→ build farm ─→ artifact store │ │ ▼ ▼ deploy orchestrator artifact CDN (regional caches) │ ▲ ▼ │ target environments (per-region clusters) ────┘ │ ▼ health monitor ─→ auto-rollback decisions │ ▼ audit + state storeThe deploy orchestrator is the moving piece. It owns:
- The state machine for “is this deploy in progress, paused, succeeded, failed”.
- The batching policy (“update 5% of instances, wait 5 min, evaluate health, then continue”).
- The rollback decision logic.
Step 5 — Data Model#
Pipelines (configuration):
table pipelines pipeline_id uuid PK service_name string source_repo string envs list<{name, region, cluster}> strategy enum(rolling, blue_green, canary) approval_gates list<env> health_checks list<HealthCheck>Deploys (runtime instances):
table deploys deploy_id uuid PK pipeline_id uuid artifact_id string // SHA of the artifact current_env string state enum(pending, in_progress, paused, succeeded, failed, rolled_back) started_at timestamp completed_at timestamp? batches list<Batch> // each batch's instance count, status metrics json // health snapshots per batchArtifacts:
table artifacts artifact_id string PK // content hash service string git_sha string built_at timestamp size_bytes bigint blob_uri string // pointer into object store metadata json // build env, dependency versionsHealth observations:
table deploy_health deploy_id uuid batch_idx int ts timestamp metrics json // error_rate, p99_latency, saturationStep 6 — Detailed Design#
Deployment strategies#
A third option: canary. Send a small fraction of traffic (1%, then 5%, then 25%) to the new version, watch metrics, abort or promote. Best blast-radius control; longer total deploy time.
Most production systems use canary inside a rolling deploy — each batch is a small percentage of total capacity, and the first few batches act as canaries before the rest follow.
The deploy state machine#
PENDING → BUILDING → ARTIFACT_READY → DEPLOYING_BATCH_1 → HEALTH_OK? │ ├─ yes → DEPLOYING_BATCH_2 → ... │ └─ no → ROLLING_BACK → ROLLED_BACKEach transition is durable. A crash of the deploy orchestrator must be recoverable — the next instance picks up the deploy at the last persisted state. Implementation: a workflow engine (Temporal, Argo Workflows, or an in-house equivalent).
Health monitoring and auto-rollback#
After each batch, the orchestrator queries:
- Error rate: 5xx rate per service, compared to pre-deploy baseline.
- Latency: p99 latency vs baseline.
- Saturation: CPU, memory, connection pool — early signal of capacity regressions.
- Business metrics: optional, e.g., conversion rate, signup success.
Each metric has a delta threshold (no more than +20% error rate, no more than +50ms p99). If any check fails for K consecutive samples in a batch, abort and roll back.
Rollback path#
For rolling: the previous artifact_id is the rollback target. Re-deploy it with the same strategy, in reverse order so the most-recently-updated batches roll first.
For blue/green: flip the LB weight back to the old environment. Already running; just a config change.
For canary: stop the canary, redirect 100% to the old version.
Cross-region rollouts#
Production usually has ~5-20 regions. Deploys promote through them with a strategy:
1. canary-region (e.g., us-east-2 with 5% of global traffic): 30 min bake.2. one-per-continent: parallel deploys to us-west-2, eu-west-1, ap-southeast-1.3. remaining regions: parallel.The promotion logic is itself a config in the pipeline. The orchestrator enforces inter-region gates.
Artifact distribution#
For a global deploy, the artifact must be in every region’s container registry / blob store. Push artifacts to a primary store; replicate to regional caches on first build. Deploys pull from the local regional cache.
A naive central registry can saturate during a global rollout (1000 instances × 500 MB artifact = 500 GB of pulls from one origin in ~5 minutes). Regional caches absorb this.
Multi-service deploys#
Microservices have dependencies: deploy A before B before C. The orchestrator reads a service dependency DAG and respects it:
deploy(service A) blocks deploy(service B) where B depends on A.In practice, contract evolution (API backwards-compat for one version) lets the order be looser, but the strict-DAG case is the worst case.
Audit and reproducibility#
Every deploy persists:
- git SHA of the source.
- artifact_id (content hash) of the binary.
- Config snapshot at deploy time.
- Deployer (user / service account).
- Approval chain (who approved promotion).
Auditors can re-derive “what was running at time T” from this log.
Step 7 — Evaluation & Trade-offs#
Bottleneck #1: artifact-pull bandwidth during global rollouts. A 500 MB artifact pulled by 10 K instances simultaneously is 5 TB of bandwidth in minutes. Regional caches mandatory; per-instance image-layer dedup (Docker-style) helps reduce per-pull bytes by 10× for incremental updates.
Bottleneck #2: health-check fan-in. The orchestrator polls health for every deploy. 100 active deploys × 10 batches × per-second health = 1,000 queries/sec to the metrics system. Trivial in isolation but adds up — push-based health (the cluster signals the orchestrator) scales better than pull.
Bottleneck #3: workflow engine durability. Persisting every state transition for every deploy is significant write load. Use a workflow engine with batched writes (Temporal) or a queue-backed state machine. Don’t write per-instance updates — that’s per-batch granular at most.
Alternative I’d push back on: “deploy by SSH to each instance and run a script”. Works at single-team scale; doesn’t scale to multi-region, doesn’t audit, doesn’t auto-rollback. Even small companies need a real CD system from day one; trying to grow into one mid-flight is painful.
What breaks first at 10× scale: the orchestrator’s per-deploy state. At 1,000 concurrent deploys, even durable workflow engines start to need sharded leadership. Per-pipeline sharding is the natural seam; co-locate per-service deploys on a single shard so dependency-DAG enforcement stays simple.
Companies this resembles#
Spinnaker (Netflix’s open-source CD), Argo CD / Argo Rollouts (Kubernetes-native), GitLab CI/CD, AWS CodeDeploy, internal systems at Google (Rapid), Meta (Conveyor), Stripe (Henson). Cousins: Terraform Cloud / Atlantis for infra deploys, kubectl rollout for raw k8s.
Related systems#
- Distributed Task Scheduler — the workflow engine substrate.
- Distributed Monitoring — health observations the auto-rollback relies on.
- Blob Store — artifact storage substrate.
- AWS Kinesis — 2020 us-east-1 outage — postmortem of what happens when control-plane deploys go wrong.