Why Not Files? The Case for a DBMS

The limits of file-based systems that drove the move to databases — redundancy, integrity, concurrency, recovery.

Concept Foundational
6 min read
foundations dbms file-systems history

Summary#

Before databases, every application defined its own file formats — fixed-width records, CSV, custom binary blobs — and every program reading or writing those files reimplemented the same handful of services: redundancy control, integrity checking, concurrency mediation, crash recovery, and access control. The 1970s case for a DBMS is that those services are not application logic. They’re plumbing common to every program that touches shared persistent data, and writing them once inside a dedicated system is strictly better than writing them ad hoc in every program.

The case has only gotten stronger. Modern relational and NoSQL systems carry decades of compounded investment in those exact services. Reaching for files past a single-process toy is a decision to re-implement that work, and that work is genuinely hard.

Why it matters#

The motivating story shows up in every introductory DBMS course because it makes the abstractions concrete. A 1970s payroll system stored employee records in one file and tax tables in another; the HR app and the payroll app each opened those files directly. Five recurring problems followed every team using this pattern, and they’re the same five problems any modern engineer will hit if they try to skip the DBMS.

In interviews, the question often comes back as “why not just use S3 / a flat JSON file / a CSV?” — particularly in early-stage system-design discussions. The expected answer is the five-property list below, framed as “what would I have to rebuild?”.

How it works#

The pre-DBMS pain points decompose into specific failure modes, each of which a modern DBMS addresses with a named mechanism:

  • Redundancy and inconsistency. The customer address lives in the orders file, the shipping file, and the billing file. A customer moves; one file gets updated, two don’t. The database answer is normalisation — store the address once, reference it via a foreign key from every consumer.
  • Difficulty of access. Every new question (“which customers in Maharashtra ordered more than $500 last quarter?”) requires a new program. The database answer is a declarative query language — SQL — and a query planner that turns the question into a plan.
  • Data isolation. Files have different formats, encodings, and conventions. Aggregating across them is a series of one-off parsers. The database answer is a uniform data model plus a catalogue describing it.
  • Integrity problems. Business rules (account balance >= 0, every order has a customer) are enforced by whichever program happens to write the file — and forgotten by the next program that writes it. The database answer is declared constraintsNOT NULL, CHECK, FOREIGN KEY — enforced by the DBMS on every write.
  • Atomicity problems. Transferring money between accounts touches two records. A crash between the two writes leaves the system in an invalid state. The database answer is transactions and the WAL — the multi-step write either commits as a unit or doesn’t happen at all.
  • Concurrency problems. Two clerks update the same record at once; the later write clobbers the earlier without seeing it. The database answer is isolation — locks, MVCC, or both — giving each transaction the illusion of running alone.
  • Security and access control. File-system permissions are coarse (read or write the whole file). Per-row or per-column access is hand-rolled in the application, and easy to bypass. The database answer is row-level security and GRANT — fine-grained, declarative permissions.

A program reading a file sits at the top of a tall stack the developer is responsible for. A program reading a database delegates almost all of that stack to the DBMS:

application code
───────────────── DBMS boundary ─────────────────
parser → planner → executor
concurrency control · constraints · access control
buffer pool ↔ storage (heap, indexes) · WAL · replicas

Variants and trade-offs#

The “files” side has two legitimate variants worth taking seriously before dismissing it.

Plain files — a CSV, a JSON file, a binary log. Cheap, no operational overhead, perfect for build artefacts, exports, single-process configuration. Fails the moment two writers exist or the file becomes large enough to need an index.
Embedded databases (SQLite, RocksDB, LMDB) — file-like in deployment (one or a few files on disk, no daemon) but DBMS-like in semantics (ACID transactions, indexes, query language). The right answer for mobile apps, CLIs, single-process services, browser storage.

Other places where “files” can still be the right answer:

  • Append-only event logs for analytics ingestion — Kafka topics, S3 with a manifest. Schema lives in a registry; queries happen later in a warehouse. The DBMS still wins eventually, just downstream.
  • Object stores for blobs — images, videos, model checkpoints. Storing these as rows wastes the row-oriented overhead; an object store with a small metadata table in a DBMS is the standard pattern.
  • Configuration and code artefacts — YAML, TOML, JSON checked into version control. The DBMS overhead would dwarf the data.

The rule of thumb: as soon as the data is shared between writers, mutable at row granularity, and queryable by predicates beyond “the whole file”, reach for a DBMS. Below that bar, files are fine.

Why 'just lock the file' isn't a fix

Whole-file locks serialise all writers — fine for one or two, catastrophic at any meaningful concurrency. Row-level locking inside a file requires the writer to understand the file’s internal structure, parse and rewrite parts of it under a lock, and handle crash recovery if it dies mid-rewrite. By the time you’ve added a lock manager, a partial-write recovery protocol, an index for fast lookups, and a schema validator, you’ve built a DBMS — and almost certainly one with subtle bugs the published implementations don’t have.

When this is asked in interviews#

Two main shapes. The first is the literal “why use a database instead of files?” — usually as the second question after “what is a database?”. The answer walks the five problems and names the DBMS mechanism that solves each. Bonus points for distinguishing the embedded-vs-server variant, which signals you’ve thought about SQLite specifically.

The second is implicit in system design. The candidate proposes “store the user records as JSON files in S3”; the interviewer asks “what happens when two services update the same user concurrently?” — which is the concurrency leg of the same argument. The expected pivot is “ok, we need a DBMS for the writable record; S3 is fine for derived blobs”.

A common trap: candidates dismiss files entirely, then propose Redis or Elasticsearch for primary storage. Those have their own redundancy / integrity / durability stories — citing the DBMS argument doesn’t free you from explaining theirs.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.