What Is a Database? — DBMS · Engineering Playbook

Summary#

A database is a shared, structured, queryable, persistent collection of data. A database management system (DBMS) is the software that gives those four words operational meaning — enforcing structure with schemas, exposing a query language, mediating concurrent access, and surviving crashes without losing committed work. Saying “the database” in conversation conflates the two, which is fine in casual speech and lethal in an interview answer where the next question is “what does the DBMS actually do for you?”.

The four properties are the smallest vocabulary that distinguishes a database from a pile of files, a single in-memory dictionary, a CSV on someone’s laptop, or a Google Sheet. Drop any one and you’ve got something else: a cache (drops persistence), a file system (drops structure), a private working set (drops sharing), or a blob store (drops queryability).

Why it matters#

Every database question in an interview eventually traces back to one of those four properties. “How do you handle two users updating the same row?” is the sharing property. “What happens when the process crashes mid-write?” is persistence plus the durability guarantee that hangs off it. “How do you find all orders over $100 from last week?” is queryability. “Why can’t I just put a JSON field there?” is structure.

There’s also a career-shaped reason it matters: the industry has fifty years of investment compounded behind relational databases — query planners, index types, replication protocols, recovery algorithms, formal semantics. Reaching for files or a homegrown store usually means re-implementing a subset of that work, badly. The right default is “use a database”; the interesting question is which one and how.

How it works#

The four properties decompose into concrete mechanisms inside any production DBMS:

Structured. A schema declares the shape of the data — tables and columns in a relational DBMS, collections and validators in a document store, key types in a wide-column store. The DBMS rejects writes that violate the schema. This is what catches typos and bad code at insert time instead of at read time, ten thousand rows later.
Queryable. A declarative language (SQL, MongoDB query, CQL) describes what data you want; the DBMS figures out how to get it. The query planner reads statistics, picks join orders and index strategies, and produces a physical plan. The user never writes the plan, which is the whole point.
Shared. Concurrent transactions read and write the same data without corrupting it. The DBMS uses locks, MVCC snapshots, or both to give each transaction the illusion of running alone. Isolation levels are the user-visible knob on how strong that illusion is.
Persistent. Committed writes survive crashes, power loss, and process restarts. Achieved through a write-ahead log (WAL) flushed to disk before acknowledgement, plus replication for surviving disk and node failure.

A request walks through the system roughly like this:

client ──> network ──> connection / session
                          │
                          ▼
                  parser → planner → executor
                          │
                          ▼
                buffer pool  ←→  storage (heap, indexes)
                          │
                          ▼
                       WAL (durability)
                          │
                          ▼
                    replicas (availability)

Each box is a chapter in the rest of the workbook. The point right now is that all of it exists to deliver the four words: structured, queryable, shared, persistent.

Variants and trade-offs#

Different DBMS families weight the four properties differently:

Relational (PostgreSQL, MySQL, Oracle, SQL Server) — strongest on structure (rigid schemas, foreign keys, CHECK constraints) and queryability (SQL with a closed algebra and decades of optimiser work). Strong sharing (ACID transactions). Persistent via WAL and replication. The right default for anything financial, account-centric, or schema-evolving.

NoSQL families (document, wide-column, key-value, graph) — relax structure (flexible or absent schemas), specialise queryability (often weaker than SQL), and may relax sharing (eventual consistency instead of ACID) to win on write throughput or horizontal scale. Right for specific workloads where one of those weights pays off.

Three secondary axes show up in interview discussion:

OLTP vs OLAP. OLTP (transactional) is many small reads and writes per second with strict latency. OLAP (analytical) is few but huge queries scanning gigabytes. Row-stores favour OLTP; column-stores favour OLAP. A single system rarely excels at both — the modern pattern is “OLTP database plus a separate analytics warehouse fed by CDC”.
Embedded vs server. SQLite runs in your process; Postgres runs in its own. Embedded gives single-writer simplicity; server gives the sharing property properly. Mobile apps and CLIs are SQLite territory; multi-user systems are server territory.
Self-hosted vs managed. RDS, Aurora, Cloud SQL, Spanner, DynamoDB hand you a database without the operational tax. Cost is higher per hour but lower per engineer-day. For early-stage systems, managed almost always wins.

Why 'use a flat file' is rarely the right answer past 100 users

Once two processes can write to the same file, you need locking (sharing). Once a write can be interrupted, you need atomic rename or a journal (persistence). Once you want to find one row out of millions, you need an index (queryability). Once business rules evolve, you need a way to migrate without rewriting every reader (structure). At that point you’ve built a DBMS, badly. The break-even is around the first two of those four needs.

When this is asked in interviews#

In the opening minutes of any DBMS-track interview, often as “what’s a database, in your words?”. The answer interviewers want is the four-word definition plus a sentence on each word. Bonus points for separating the database from the DBMS in the same breath.

Also turns up as a setup for harder questions: “given your definition, where does Redis fit?” (mostly a cache, but persistence-as-an-option makes it a database too) or “why is a Google Sheet not a database?” (it is, sort of — but the DBMS is so weak that the sharing and structure properties barely hold). The interviewer is testing whether you can apply the definition, not recite it.

System-design loops use the same vocabulary at a higher level: “which database family?” is really “which weighting of the four properties does this system need?”.