Data Representation and Efficient Communication — API Design

Summary#

Every API call is, at some layer, bytes on a wire. Before those bytes leave your process they were a Python dict, a Go struct, a Java object; on the other side they will be parsed back into whatever shape that runtime prefers. The format in between — JSON, XML, YAML, Protobuf, MessagePack, Avro — is the wire representation, and choosing it is one of the load-bearing decisions in API design.

The choice runs along two axes. Textual versus binary — is the payload human-readable when you tcpdump the connection, or is it a stream of bytes only a decoder can interpret? Schema-on-wire versus schema-out-of-band — does the message carry its own field names, or does the decoder need the schema separately to make sense of it? These two axes give you four quadrants and almost every popular format sits cleanly in one.

A well-designed system uses two formats, not one. Public APIs, browser-facing endpoints, debug logs, and configuration get a textual format (almost always JSON). Internal high-throughput RPC, event streams, and storage formats get a binary format with a schema (Protobuf, Avro). The temptation to unify on one format is real and almost always wrong — the requirements for “developer types curl and reads the response” and “100k QPS between two services in the same VPC” are not the same.

Why it matters#

Three reasons the wire format is worth caring about explicitly:

It is irreversible at the contract level. You can rewrite the service behind your /orders endpoint. You cannot quietly switch from JSON to MessagePack — every client breaks. The wire format is part of the public contract.
It dominates CPU and bandwidth at scale. A naive Twitter-shaped feed in JSON is roughly 3x the size of the same payload in Protobuf, and JSON parsing in most languages is 5-10x slower than Protobuf decoding. At small scale this is invisible; at LinkedIn or Meta scale it is the difference between two data centres and three.
It shapes how the API evolves. Adding a field is free in both JSON and Protobuf, but for different reasons (JSON ignores unknown keys by convention; Protobuf evolution rules are formal). Removing or renaming a field is breaking in both — but JSON has no schema to catch the mistake at build time, while Protobuf does.

For an interview, the fluency signal is: when asked “how does your service talk to the client?”, you name a format and explain why — not just “JSON because everyone uses it”.

How it works#

A wire format does four things. Every format trades them off differently.

1. Encode in-memory structures to bytes#

The serializer walks an object graph and emits a byte sequence. JSON walks dicts and lists and emits printable characters. Protobuf walks structs and emits tag-length-value triples where the tag is a small integer (the field number). MessagePack walks the same structure as JSON but emits a binary stream with single-byte type prefixes.

2. Carry enough metadata for the decoder#

A JSON message says {"id": 42} — the decoder learns the field name (id), the type (number), and the value (42) from the bytes themselves. A Protobuf message says 08 2A (tag 1 with varint 42) — the decoder must already know that field 1 is called id and is of type int32. The schema is out of band.

Avro splits the difference: the schema travels with the data in batch files (Avro Object Container Files) but is negotiated at the start of an RPC stream so individual messages stay compact.

3. Define an evolution model#

What happens when the sender’s schema is newer than the receiver’s? JSON’s convention is “unknown fields are ignored, missing fields take the default.” Protobuf is formal: field numbers must never be reused, removed fields must be reserved, all new fields must be optional. Avro is even more explicit: it ships a writer schema and a reader schema and defines a compatibility check between them.

The evolution model determines whether you can deploy services independently. Without a clear model, a v2 service starts sending a new field and a v1 service hard-fails decoding — a self-inflicted outage during rollout.

4. Handle the language-mapping problem#

Bytes have to land in a typed runtime. Every format defines how its types map to Python int, Go int64, Java long. JSON’s number type is famously underspecified — 9007199254740993 (2^53 + 1) survives a round-trip through Python’s int but loses a bit through JavaScript’s Number. Protobuf is precise: int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64. The cost is that the precision comes from the schema, not the wire bytes.

Variants and trade-offs#

The four quadrants of wire formats, with one representative format in each:

Textual, schema-on-wire (JSON, XML, YAML). The bytes describe themselves. A human can curl the endpoint and read the response. Every language has a parser in the standard library. Cost: 2-5x the size of binary, slower to parse, no compile-time validation of message shape.

Binary, schema-out-of-band (Protobuf, FlatBuffers). A compact tag-length-value stream. The schema is a separate .proto file shared between sender and receiver. Code generation produces typed stubs. Cost: opaque on the wire (tcpdump shows hex), tooling burden, schema-registry infrastructure.

The other two quadrants:

Textual, schema-out-of-band — rare in API contexts. CSV is the canonical example: the column names live in the header row or out-of-band entirely. Used for data export, not RPC.
Binary, schema-on-wire — MessagePack, BSON, CBOR. The bytes carry both field names and values; the format is compact-binary but you still pay for repeated string keys. The middle ground: smaller than JSON, looser than Protobuf.

A more concrete comparison on a representative payload — a single order record with 8 fields:

Format	Size (bytes)	Parse time (relative)	Human-readable?	Schema required?
JSON	280	1.0x	Yes	No
XML	420	1.5x	Yes	No
YAML	240	2.0x	Yes	No
MessagePack	180	0.6x	No	No
Protobuf	90	0.2x	No	Yes
Avro (batched)	70	0.3x	No	Yes

Numbers are illustrative — real workloads vary by an order of magnitude based on string content, repeated fields, and decoder implementation.

Pick by audience, not by fashion#

Public partner API — JSON. Non-negotiable. Every integrator can debug a JSON response with curl; nobody wants to install a Protobuf compiler to make their first call.
Internal RPC mesh — Protobuf over gRPC. Code-generated clients, fast parsing, formal evolution.
Browser front-end — JSON. Browsers parse it natively; debugging in DevTools is one click.
Event streams, columnar storage — Avro or Protobuf. The wire is invisible; the storage cost is dominant.
Configuration files — YAML or TOML. Humans write these by hand; readability beats parse speed by orders of magnitude.

When this is asked in interviews#

The data-format question shows up in three shapes:

“How do you serialize the payload?” — The expected answer is a format plus a reason. “JSON because the consumer is a browser” or “Protobuf because we have a polyglot internal mesh and care about p99 parse latency.” The wrong answer is “JSON” without context.
“What happens when the schema changes?” — The interviewer is probing the evolution model. JSON tolerates unknown fields; Protobuf has formal field-number stability rules; Avro has reader/writer compatibility. Naming any one of these confidently is the senior signal.
“Why not just use one format everywhere?” — The expected answer: public surfaces optimise for developer experience (JSON), internal surfaces optimise for cost (Protobuf), and the cost of running both is small compared to the cost of mis-optimising either.

A common follow-up: “we have a Python client and a Go service, and the message is 50KB and called 10k times per second — what format?” The right reasoning chain is roughly: 50KB * 10k = 500MB/s of payload, which is > 100x the cost of JSON parsing in most runtimes, so binary is justified; both languages have first-class Protobuf support, so Protobuf is the natural pick; if the message has many optional fields and the message rate grows further, consider FlatBuffers for zero-copy decoding.

Textual Data Formats — JSON, XML, YAML — JSON, XML, YAML in detail. The trio every API designer must read fluently.
Binary Data Formats — Protobuf, MessagePack, Avro — Protobuf, MessagePack, Avro. Schema-first vs schema-less binary.
HTTP — The Foundational Protocol for APIs — the transport layer that carries these bytes.
Remote Procedure Calls (RPCs) — where schema-first binary formats earn their keep.
REST vs GraphQL vs gRPC — Comparison — the broader REST/GraphQL/gRPC comparison that data-format choice feeds into.