GitLab 2017 — The Database Outage — Operating Systems

What happened#

On January 31, 2017, a GitLab.com engineer responding to a database replication issue ran sudo rm -rf on what he thought was a secondary PostgreSQL database directory. It was actually the primary. Within seconds, ~300 GB of production data — issues, merge requests, comments, users, snippets — was gone.

The engineer caught it within ~2 seconds and Ctrl-C’d, but by then the directory was almost entirely empty. What followed over the next ~18 hours is the part that turned a recoverable incident into a public lesson: of the five backup-and-replication mechanisms GitLab had in place, none worked at the moment they were needed. The team eventually restored from a 6-hour-old snapshot taken by a staging server, losing 6 hours of writes — 5,000+ projects, 5,000+ comments, 700+ new users. Throughout the recovery, the team live-streamed the operation on YouTube and published a meticulous public postmortem.

Context#

GitLab.com at the time ran on a small PostgreSQL setup with one primary, one streaming replica, and a constellation of backup mechanisms. The week leading up to the incident had been rough:

An influx of spam comments was driving replication lag on the secondary.
A long-running pg_basebackup to seed the secondary was failing repeatedly because the WAL stream couldn’t keep up.
Engineers had been fighting the lag through the day; the on-call had been awake for many hours.

The state at the moment of the incident: the secondary was confused, the primary was overloaded, the engineer was tired, and the engineer’s terminal had two SSH sessions open — one to db1 (primary) and one to db2 (secondary), with similar-looking prompts. He ran the destructive command in the wrong one.

This isn’t a story about a bad engineer. It’s a story about an operations design that depended on the engineer not making a class of mistake that fatigued humans reliably make.

Trigger and propagation#

The immediate trigger was the rm -rf itself. The propagation was instant: PostgreSQL doesn’t hold the data file open in a way that protects against deletion — rm succeeded, the on-disk pages were deallocated, the running Postgres process continued for a moment using cached pages but couldn’t survive a checkpoint.

What was supposed to break the fall:

The streaming replica (db2). Out of sync at the moment of the incident — the very replication lag the engineer was fighting meant the secondary was hours behind. Promoting it would have lost hours of data anyway, and the on-call wasn’t sure of its actual state.
pg_dump logical backups, supposedly daily. Investigation found the cron job had been failing for years; the output files were empty. Nobody had checked.
LVM snapshots. Were configured but the snapshot volume had filled and snapshots had silently stopped.
Azure disk snapshots. Were configured for db2 only, not for the primary db1. (The team had assumed they covered both.)
S3 backups of WAL archives. The S3 bucket was empty — the credentials had been rotated and the upload process had been failing silently.

Five mechanisms; five failures.

Detection and response#

Detection was immediate — the engineer realised within seconds. Response, however, was constrained by the absence of any working backup. The recovery path the team followed:

A staging server had taken a pg_dump ~6 hours earlier. This was the only intact copy of the database that existed anywhere.
The team copied that dump from staging to a recovery host, restored it into a fresh PostgreSQL instance, and brought GitLab.com back up against the recovered copy.
The interval between the staging snapshot and the deletion — about 6 hours — was permanently lost. Users who had created issues, merge requests, comments, or accounts in that window had to recreate them.

Total downtime was about 18 hours. The team made the unusual choice to live-stream the recovery on YouTube — viewers watched, in real time, as the team navigated the restoration. The transparency was widely praised and helped maintain customer trust through a recovery that, for many SaaS companies, would have triggered a far worse public-relations response.

Root cause#

Asking “what was the root cause” lets you find as many causes as you have time to dig. GitLab’s published postmortem identifies several layers:

Lack of safety on the destructive command. No --prompt-before-deleting. No staging delay. No mv to a quarantine instead of rm. The command, once issued, was immediate.
Human factors. Fatigued on-call. Two similar-looking SSH sessions. A naming convention (db1, db2) that gave the engineer no terminal-visible cue about which environment he was in.
Untested backups. All five backup mechanisms were in some state of disrepair. None had been validated by attempting a restore. Several had been broken for years without anyone noticing.
No monitoring on backup success. The cron jobs that were silently failing were failing silently — no Slack ping, no PagerDuty incident, no dashboard with a freshness indicator.
A culture that treated backups as configured-once, never-tested. The implicit assumption was that configured = working.

The “root cause” is the intersection. Fix any one of these and the incident would have been recoverable; missing all of them produced data loss.

Lessons and changes#

GitLab’s published remediation list ran to several pages. The highlights:

Validate every backup by restoring it. A backup that hasn’t been restored doesn’t exist. GitLab moved to automated periodic restore-and-test for every mechanism.
Monitor backup success, not just configuration. Every mechanism gained a Prometheus exporter for last-success-time. Alerts fired if that metric stopped advancing.
Standardise on a small number of mechanisms. Five was too many — the team consolidated on logical (pg_dump-based) plus physical (WAL archiving + base backups) with the rest deprecated.
Make destructive commands harder. Tools like safe-rm, trash-put, and policy-enforcing wrappers were introduced for production hosts.
Make terminal context visible. Production prompts gained colour-coded backgrounds and prominent hostname/environment labels.
Run “GameDay” exercises. Quarterly drills where the team practiced full database restoration against the production runbook.

The runbook for replacing a primary, in particular, was rewritten to make “wrong target” much harder.

What it teaches in general#

GitLab 2017 is the canonical postmortem for any team operating a stateful production database. The structural lessons generalise:

A backup you haven’t restored isn’t a backup. This is true of every backup, every snapshot, every replication setup. The only validated backup is one you’ve successfully restored end-to-end against a test environment.
“It’s configured” is not “it’s working”. Cron jobs fail silently. Disk volumes fill. Credentials rotate. Without monitoring the output, the success metric, the freshness timestamp, you don’t actually know.
Fatigue is part of the threat model. Production runbooks must be safe for tired humans operating at 2am. Visual cues, type-the-hostname-to-confirm guards, and undo-able destructive commands are not paranoia — they’re load-bearing.
Public postmortems are a moat. GitLab’s transparency through this incident turned what could have been a brand catastrophe into a marketing event. The lesson there isn’t to court disaster; it’s that openness about real failures earns more trust than a sanitised “all systems normal” stance.

What I look for now when reviewing a backup strategy

Three questions, in order. (1) When was the last time you restored the most-recent backup into a fresh environment and confirmed the restored data was complete? (2) What is the dashboard showing the freshness timestamp of every backup mechanism in production, and what’s the alert if that timestamp stops advancing? (3) What’s the runbook for a full recovery, and when was the last time someone unfamiliar with the system followed it end-to-end? If any of those answers is “never” or “I don’t know”, you have GitLab 2017 in your future.