Maintainability — System Design · Engineering Playbook

Summary#

Maintainability is the cost of keeping a system running, evolving, and on-call-survivable over years. It’s the NFR with the longest feedback loop — a maintainability mistake in week one shows up as a 3 AM page in year three. It’s also the NFR most often skipped in interviews, which is exactly why surfacing it is a senior signal.

Why it matters#

Every architectural choice has an operational tail. A new microservice adds a deploy pipeline, a dashboard, an on-call rotation, a set of alerts, and a section of the runbook. A new datastore adds backup tooling, schema-migration mechanics, monitoring integration, and an “is this DBA on PTO?” question for every incident. These costs aren’t visible at design review; they’re visible at year-end on-call retros.

The seniority signal is treating maintainability as an input to the design rather than an afterthought. A senior engineer counts new services the way a CFO counts new vendors — every one needs justification.

How it works#

Three sub-dimensions, each with its own pressure:

Operability#

How easy is it to run? Includes: monitoring, alerting, dashboards, runbooks, on-call tooling, deploy mechanics, rollback story, configuration management. Bad operability shows up as long incidents (no one can find the metric) and high false-positive page rates (alerts on the wrong thing).

The test: a new on-call rotates onto your service Friday at 5 PM. By Monday morning, has the service paged them with a useful page or with a noisy one? Have they been able to diagnose anything without paging you?

Simplicity (manageable complexity)#

How much accidental complexity does the design carry? Includes: number of services, number of distinct datastores, number of programming languages, number of “exceptions to the pattern”. Simple isn’t easy — it’s expensive to keep a system simple over years of feature growth — but every avoided complexity pays compound interest.

The classic test: can a new hire describe the entire system on a whiteboard in 20 minutes after their first month? If no, complexity has accumulated.

Evolvability#

How cheap is the next change? Includes: API versioning, schema migrations, backward compatibility windows, deployability of one service without coordinating with others, blast radius of a refactor. Evolvability is what lets a team ship features in week three of a quarter without burning the first two weeks on coordination.

Variants and trade-offs#

Boring stack (one language, one database, one queue, one deploy pipeline) — fewer operational surfaces, easier on-call, easier hiring. Caps the optimisation ceiling — your one database isn’t the best for every workload.

Best-fit stack (Postgres for OLTP, ClickHouse for analytics, Redis for cache, Kafka for events) — each component optimal for its job. Operational surface is N× — every component is a separate runbook, monitoring story, upgrade cadence.

The observability triangle is where most maintainability investment lands:

Logs. Easy to add, expensive at scale, hard to query at scale unless paired with a real log search system. Best for: post-hoc debugging of specific requests.
Metrics. Cheap, fast, structured, low-cardinality. Best for: dashboards, alerts, SLOs. Worst for: “what happened to this one user”.
Traces. Mid-cost; expensive when sampled at 100%, cheap when sampled at 1%. Best for: cross-service request flow. Worst for: tail-latency causes (since you usually didn’t sample the slow request).

Most systems need all three at different levels. The maintainability cost is not just the storage bill — it’s the cognitive load of three separate query languages and three separate UIs to correlate during an incident.

Schema migrations and API versioning are the evolvability sub-questions that get drilled in real reviews:

Backward-compatible schema changes only. Add nullable columns, never rename / drop in one step. Two-phase deletes: stop writing, wait, stop reading, then drop.
API versioning. Bump major version, run two versions in parallel during the deprecation window, monitor the old version’s usage to zero, then remove. Window measured in months, not weeks.

Why on-call load is the canary metric

If a team’s average on-call rotation gets paged more than 1–2 times per week with action-required pages, the system is not maintainable. The team will burn out, attrition spikes, alert fatigue silences real alarms, and the next outage will be the bad one. Maintainability targets are often expressed in terms of this metric: “no rotation gets more than X pages per week” is a more honest SLO than any uptime number.

When this is asked in interviews#

Two reliable triggers:

Step 7 (evaluate). “What’s hardest to operate in this design?” The expected answer names a specific component — usually the one with the most state, the most cross-service contracts, or the most cache-warm dependencies — and proposes a mitigation.
Implicitly throughout. Every time the candidate adds a service or a datastore, the interviewer is silently counting them. Past 5 or 6 in a 45-minute design, the implicit question becomes “are all of these actually justified?”

More heavily weighted at:

Senior+ infrastructure / SRE / platform interviews — maintainability is half the actual job.
Companies known for operational excellence (Stripe, Cloudflare, Datadog, Google SRE).
Smaller-team interviews (startup, FAANG L4 below) where the candidate will share an on-call rotation with the interviewer.

Common follow-ups:

“Walk me through what happens when this service pages your on-call at 3 AM.”
“How would a new hire learn this system in their first month?”
“Which of these microservices would you collapse into a monolith if you had the chance?”
“What’s your deploy-and-rollback story? Can you ship a fix in 10 minutes?”