GitLab 2017 — The Database Outage

A mistaken `rm -rf` on the primary; five backup mechanisms that all failed; the public postmortem everyone should read.

Postmortem Foundational
10 min read
gitlab backup postgres replication postmortem
Companies this resembles: GitLab

What happened#

On the evening of 31 January 2017, a GitLab.com site-reliability engineer ran rm -rf /var/opt/gitlab/postgresql/data on what he believed was a misbehaving secondary database server. It was the primary. Within seconds, ~300 GB of production PostgreSQL data was being deleted. He noticed and aborted — but only ~4.5 GB remained.

GitLab then discovered, over the next six hours, that none of their five backup or replication mechanisms had actually worked. Streaming replication had been broken for days. The daily pg_dump was producing empty files because of a Postgres version mismatch. The disk snapshots were configured but not running. The S3 backup bucket was empty. Azure disk snapshots existed for the file servers but not the database server. The only usable copy was a six-hour-old staging snapshot a different engineer had happened to take that afternoon — not as part of the backup process, but as an ad-hoc operation.

GitLab restored from the staging snapshot. Six hours of data — 5,037 projects, 5,000 comments, 707 user records, plus webhooks and CI/CD events created in that window — were permanently lost. The site was down for ~18 hours; data restoration took longer. GitLab published a detailed live-blogged postmortem during the outage that became, and remains, one of the most-read incident write-ups in the industry.

Context#

GitLab in early 2017 was a fast-growing competitor to GitHub: ~1.5 million users, a self-hosted product plus a hosted SaaS (GitLab.com), the database on PostgreSQL 9.6. The team operating GitLab.com was small — handful of SREs, no dedicated DBA — running a setup that had grown organically.

The database tier:

  • One primary PostgreSQL 9.6 server, write-heavy.
  • One standby via streaming replication, read-mostly.
  • Daily pg_dump written to disk and shipped offsite.
  • Periodic Azure disk snapshots (LVM-based) of the data volume.
  • Replication slot logs streamed to a separate offsite location.
  • S3 bucket configured as a backup target via a separate scheduled job.

Five backup paths, in industry-best-practice fashion. None had been end-to-end-tested in production by restoring from them — the textbook gap in every “we have backups” claim.

Before the outage, the on-call SRE had been responding to a spam attack: a flood of malicious user signups creating fake projects, which had caused replication to lag behind the primary by 4 GB. The standby was struggling to catch up; the SRE was trying to fix it. He attempted to clear a stuck PostgreSQL process on the standby, then re-initialise the standby’s data directory by deleting the existing files and re-pulling from the primary. He ran rm -rf in the standby’s data directory.

He was logged into the primary.

Trigger and propagation#

The single command — rm -rf /var/opt/gitlab/postgresql/data on the primary — was the trigger. It propagated as:

  1. Immediate: rm started deleting files. The SRE noticed quickly (within ~2 seconds) and aborted with Ctrl+C, but the recursive delete had already removed most of the PostgreSQL data directory. The cluster could no longer serve queries.

  2. Within minutes: GitLab.com went read-only. The primary’s PostgreSQL process held some files open (active WAL writes), so a small amount of state was still in memory and could be flushed — but the data files themselves were gone. The standby had not yet caught up to the most recent writes; the team chose not to fail over because the spam attack had left the standby in an inconsistent state and they couldn’t tell how far behind it actually was.

  3. Within an hour: the team began checking backups. The streaming replication slot was found broken (since 24 January, by a deployment change that had silently disabled it). The pg_dump files were 0 bytes — pg_dump 9.2 cannot dump a 9.6 cluster, and the cron job’s logs hadn’t been monitored. The Azure disk snapshot for the database server was not configured (only for application servers). The S3 bucket was empty — a permissions misconfiguration had silently caused every upload to fail.

  4. Within ~6 hours: A different engineer had, earlier that day, taken an ad-hoc LVM snapshot of the staging environment for an unrelated test. That snapshot — created at 17:20 UTC, six hours before the deletion — was the most recent usable copy of production data. It was promoted to be the recovery source.

The failure was not one event but a cascade: the spam attack causing replication lag, the lag triggering the recovery attempt, the recovery attempt running on the wrong server, the wrong-server delete being undetectable because of the broken backups. Any one of the five backup mechanisms working would have bounded the data loss to seconds or minutes instead of six hours.

Detection and response#

The deletion itself was self-detected — the SRE saw the command on the wrong terminal and aborted within seconds. PostgreSQL’s monitoring fired immediately (no data files = no service).

The response was published live: GitLab opened a Google Doc, then a Twitter live-blog, then a YouTube live stream during the outage. The team narrated the recovery in real time, including the discoveries that each backup mechanism had failed. This level of transparency was unusual then and remains unusual now — most companies publish a postmortem days or weeks after the fact, cleaned up.

The recovery itself:

  1. Stop all writes — site already in read-only state.
  2. Identify the freshest usable copy of data — the staging LVM snapshot from 17:20 UTC.
  3. Copy it to the production server, then restore PostgreSQL atop it.
  4. Replay any WAL still on disk to bring the cluster as close to current as possible.
  5. Restart application services pointing at the recovered database.

Total wall-clock time from rm to site back online: ~18 hours. Data lost: ~6 hours of writes, including ~700 user accounts created in that window.

Root cause#

The proximate cause was the operator command — rm -rf on the primary by mistake. Every root-cause analysis stops short if it ends there. The actual chain:

  1. No prompt distinguishing primary from standby. The shells looked identical; no visual or textual indication of “you are on PRIMARY”. A single character of distinguishing prompt — a colour, a banner — would have prevented the operator error.
  2. No two-person rule for destructive operations. rm -rf on a database data directory has no plausible non-destructive recovery; it should require either a second engineer’s confirmation or an explicit “I have verified this is the standby” challenge.
  3. Backups were configured but never tested. The single biggest lesson of the outage. Every backup mechanism was in place; none had been validated by restoring from it. “We have backups” is a hypothesis until proven by a successful restore.
  4. Monitoring of the backup pipeline itself was absent. The pg_dump job was producing 0-byte files for days; nothing alerted. The S3 upload was failing for weeks; nothing alerted. Backup jobs were monitored (“did cron run?”); backup outputs weren’t (“did the file contain a usable database?”).
  5. Replication slot health wasn’t monitored. The replication slot had been broken since 24 January — a week before the outage. The team noticed only when looking for it during recovery.
  6. No runbook for “primary lost”. The on-call had no documented sequence to follow; recovery was improvised. Every action had to be reasoned about from scratch under stress.

The deepest root cause is the divergence between “the system has redundancy” and “the redundancy works”. GitLab had textbook redundancy. None of it could be relied on at the moment it mattered.

Lessons and changes#

GitLab’s post-incident work, published in detail:

  • Rebuilt the backup infrastructure end-to-end. New tooling for automated backup verification — restore-test runs that take a backup file, restore it into an isolated environment, run sanity queries, and alert on failure.
  • Daily restore exercises. A scheduled job restores the latest backup into a parallel environment every 24 hours; success/failure is a first-class alert.
  • Distinguished prompts. Production shells became visually distinct from staging and from standbys; the prompt explicitly labels the host’s role.
  • Removed direct write access. Most database operations moved behind tooling that requires a ticket and a peer review.
  • Multiple backup paths confirmed working. Streaming replication, periodic logical dumps (with the correct pg_dump version), filesystem snapshots, and S3 archival — every path with its own monitoring and verification.
  • Postmortem culture. GitLab open-sourced the incident write-up itself; the blameless postmortem template became part of GitLab’s documented operations.
  • PostgreSQL upgrades. Migrated to versions with better replication observability and faster recovery.

The internal cultural shift was as important as the technical one: backups went from “infrastructure that exists” to “infrastructure that’s continuously tested under realistic restore conditions”.

What it teaches in general#

The shape of this incident generalises far past GitLab:

  • Backups must be measured by successful restores, not by successful writes. A pipeline that produces backup files but cannot be used to restore is a pipeline that exists only to look reassuring.
  • Replication is not a backup. A streaming standby propagates deletes too. rm on the primary deletes the data; the replica replicates the delete (or, in this case, the WAL gap and resulting inconsistency). Backups must include some form of point-in-time recovery, not just continuous mirroring.
  • The operator-error scenario is the most important scenario. Hardware fails rarely; software fails predictably; human error happens daily. Systems designed for hardware failure but not for human error are systems that have not done the math on actual failure rates.
  • Visual distinction prevents whole classes of bugs. Different prompts, different colours, different backgrounds — small visible cues catch the kind of operator error this incident embodied.
  • Transparent postmortems beat polished ones. GitLab’s live narration of the recovery, including the embarrassing discoveries, built trust faster than a polished after-the-fact document would have. The industry’s adoption of public postmortems accelerated visibly after this incident.
  • Two-person rule for destructive ops. rm -rf on a data directory, DROP TABLE, truncate on a production database — operations with no recovery path should require a second pair of eyes or a strong typed confirmation (“yes, I want to delete PRIMARY”).
Before GitLab 2017 — most teams considered “we configured backups” sufficient evidence of resilience. Backup verification was an aspirational best practice.
After GitLab 2017 — restore drills became table stakes for serious infrastructure teams. The “untested backup is no backup” framing is now industry-standard language. Tools like Postgres pg_verifybackup, repository-level restore tests, and dedicated DR exercises became normalised.
The single line that's most quoted from this incident

From the GitLab postmortem: “Out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.” That sentence — that five mechanisms existed and none worked — is the line that got the incident remembered. It’s the line that crystallised the truth that having backups configured is not the same as having backups. Every SRE team reads it eventually; many quote it in their own internal docs to justify restore drills.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.