Binary Data Formats — Protobuf, MessagePack, Avro
When the wire matters more than the diff. Schema-first vs schema-less, evolution semantics, the size-vs-debuggability trade-off.
What it is#
A binary data format encodes structured data as a compact byte stream rather than human-readable text. The wire is opaque — you cannot curl an endpoint and read the response — but the bytes are smaller, parsing is faster, and the schema (when present) catches mistakes at compile time that JSON would only catch in production.
Five binary formats matter in practice in 2026:
- Protocol Buffers (Protobuf) — Google, 2008 (public). Schema-first via
.protofiles; tag-length-value encoding with small-integer field numbers. The default wire format for gRPC and the default internal RPC format at every large polyglot company. - MessagePack — open-source, 2008. JSON-compatible binary: same data model (objects, arrays, primitives) but a compact binary encoding. No schema required. Used by Redis (
MSGPACKmodule), Pinterest, Treasure Data. - Apache Avro — Hadoop ecosystem, 2009. Schema-with-data: the schema travels with the file (object container format) or is negotiated at the start of an RPC stream. JSON-defined schemas, dynamic typing at decode time. Standard for Kafka payloads via Confluent Schema Registry.
- Cap’n Proto — Sandstorm, 2013. Zero-copy decoding — the on-disk layout is the in-memory layout, so reading a field is a pointer arithmetic operation, not a parse. Used where parse latency is dominant.
- FlatBuffers — Google, 2014. Same zero-copy property as Cap’n Proto, originally built for games (Android) where parse-time GC pauses ruin frame rates.
The three that come up most in API-design interviews are Protobuf, MessagePack, and Avro — and the differences between them are the substance of the topic.
When to use it#
- Reach for Protobuf for internal RPC where you control both sides of the wire, where a schema-registry exists or can be added, and where polyglot code generation is a win. Default for gRPC, the natural pick for microservice meshes.
- Reach for MessagePack when you want a binary format with JSON semantics — no schema, dynamic decoding, drop-in replacement for
json.dumps. Caches (Redis), local IPC, embedded contexts, places where introducing a.protofile is overkill. - Reach for Avro when the data is streamed or stored rather than RPC’d — Kafka topics, Hadoop files, columnar data lakes. The schema-with-data model fits batch and stream workloads where the consumer learns the schema at read time, not link time.
- Reach for FlatBuffers / Cap’n Proto when decode latency dominates — high-frequency trading, game engines, mobile UIs decoding large payloads. The zero-copy property lets you read a single field without parsing the whole message.
- Avoid binary formats for public APIs, browser-facing endpoints, debug logs, and configuration. The cost of “you can’t
curlthe response” outweighs the size win when humans are in the loop.
How it works#
The same logical message — a user record with three fields — across the three primary formats.
Protobuf: schema-first, field-numbered#
syntax = "proto3";package example.v1;
message User { string id = 1; string email = 2; int32 created_unix = 3;}The .proto file is compiled to language-specific stubs (protoc --python_out=. --go_out=. --js_out=. user.proto). On the wire, a User { id: "u_42", email: "a@b.co", created_unix: 1717000000 } becomes roughly:
0a 04 75 5f 34 32 // field 1 (id), length 4, "u_42"12 06 61 40 62 2e 63 6f // field 2 (email), length 6, "a@b.co"18 80 e4 c7 b8 06 // field 3 (created_unix), varint 1717000000Field numbers are the wire identity. Renaming email to user_email in the .proto file is free; renumbering field 2 to field 4 silently corrupts every existing message on disk. Field-number stability is the first rule of Protobuf evolution.
MessagePack: schema-less, JSON-compatible#
import msgpackdata = {"id": "u_42", "email": "a@b.co", "created_unix": 1717000000}wire = msgpack.packb(data)# Roughly 35 bytes:# 83 (3-element map)# a2 69 64 a4 75 5f 34 32 ("id", "u_42")# a5 65 6d 61 69 6c a6 61 40 62 2e 63 6f ("email", "a@b.co")# ac 63 72 65 61 74 65 64 5f 75 6e 69 78 ce 66 5f 1f 80 ("created_unix", 1717000000)Field names appear on every message (the id and email strings are there). No schema is required, but the cost is repeated string keys — for messages with many small repeated objects, MessagePack approaches JSON’s overhead.
Avro: schema-with-data, dynamic decoding#
{ "type": "record", "name": "User", "namespace": "example.v1", "fields": [ {"name": "id", "type": "string"}, {"name": "email", "type": "string"}, {"name": "created_unix", "type": "int"} ]}Avro encodes the values in schema order with no field tags — the bytes are just len(id) | id | len(email) | email | varint(created_unix). The decoder must have the writer schema to make sense of them.
In an Avro object container file (.avro), the schema is the first thing in the file. In an Avro RPC stream (or Kafka via Confluent Schema Registry), a small schema ID is embedded in each message and the registry holds the schema. The result: messages are nearly as compact as Protobuf, but no compile-time stub generation is required — the decoder uses the schema dynamically.
Three-language serialize/deserialize with Protobuf#
The shape of a real RPC handler: deserialize a request, do something, serialize a response.
# pip install protobuf# protoc --python_out=. user.proto → user_pb2.pyfrom user_pb2 import User
def encode(u_id: str, email: str, ts: int) -> bytes: u = User(id=u_id, email=email, created_unix=ts) return u.SerializeToString() # ~25 bytes for the example record
def decode(wire: bytes) -> dict: u = User() u.ParseFromString(wire) return {"id": u.id, "email": u.email, "created_unix": u.created_unix}// protoc --go_out=. user.proto → user.pb.gopackage main
import ( "google.golang.org/protobuf/proto" pb "example.com/gen/example/v1")
func encode(id, email string, ts int32) ([]byte, error) { u := &pb.User{Id: id, Email: email, CreatedUnix: ts} return proto.Marshal(u)}
func decode(wire []byte) (*pb.User, error) { u := &pb.User{} err := proto.Unmarshal(wire, u) return u, err}// npm install protobufjs// Or generate stubs with `protoc --js_out=...`const protobuf = require("protobufjs");
async function load() { const root = await protobuf.load("user.proto"); return root.lookupType("example.v1.User");}
async function encode(id, email, ts) { const User = await load(); const msg = User.create({ id, email, createdUnix: ts }); return User.encode(msg).finish(); // Uint8Array}
async function decode(wire) { const User = await load(); return User.decode(wire);}Schema evolution#
Each format has its own rules for what you can and cannot change without breaking deployed clients.
| Change | Protobuf | Avro | MessagePack |
|---|---|---|---|
| Add an optional field | Safe (assign new field number; old decoders ignore) | Safe (add to schema, provide a default) | Safe (new key in map; old decoders ignore) |
| Remove a field | Safe if you reserve the field number | Safe if the field has a default and reader knows about it | Safe (the key is just missing) |
| Rename a field | Safe (the name is metadata; the wire uses the number) | Breaking (the name is on the wire) | Breaking (the name is on the wire) |
| Renumber a field | Breaking | N/A (no numbers) | N/A |
| Change a field’s type | Limited — int32 → int64 is safe, string → bytes is safe; most others are breaking | Limited — only certain promotions are compatible | Safe at the wire level; breaks the consumer’s expectations |
| Make a required field optional | Safe in proto3 (all fields are implicitly optional) | Safe (add a default) | N/A |
Protobuf’s superpower is the field-number stability rule. Decoders are forward and backward compatible by construction as long as you never reuse a field number. Avro’s superpower is reader-writer schema resolution — the same byte stream can be read by a decoder with a slightly different schema, and Avro defines exactly which differences are compatible.
Variants#
Beyond the big three, four formats fill specific niches:
| Format | Origin | Niche |
|---|---|---|
| Cap’n Proto | Kenton Varda (ex-Protobuf), 2013 | Zero-copy decoding for ultra-low-latency RPC. |
| FlatBuffers | Google, 2014 | Zero-copy for games/mobile. Used by Android Wear, Cocos2d. |
| BSON | MongoDB, 2009 | Binary JSON with extra types (Date, ObjectId). Internal to MongoDB. |
| CBOR | RFC 8949, 2013 | ”Binary JSON” with IETF backing. Used in COSE, CoAP, WebAuthn. |
| Thrift | Facebook, 2007 | Schema-first like Protobuf, paired with the Thrift RPC framework. Largely superseded by gRPC + Protobuf. |
Trade-offs#
What binary formats buy you:
- 2-5x smaller payloads on typical records (more on records with many small fields).
- 3-10x faster parsing in most languages. Protobuf and FlatBuffers benchmark consistently faster than
encoding/jsonin Go,simdjsonin C++, andJSON.parsein V8. - Compile-time schema validation (Protobuf, Thrift, FlatBuffers, Cap’n Proto). A typo in a field name is a build error, not a runtime KeyError.
- Polyglot code generation. One
.protofile, stubs in twelve languages.
What they cost:
- Opacity on the wire. Debugging requires
protoc --decode_rawor a Wireshark plugin. The on-call experience is worse. - Schema-registry infrastructure. Avro effectively requires a schema registry in production (Confluent’s is standard). Protobuf doesn’t require one, but a
.protorepository is its informal equivalent. - Tooling weight. A new language must add Protobuf support before it can join the mesh. JSON works in every language for free.
- Evolution discipline. Field-number stability rules require team discipline and code review. A
.protofile that lets anyone delete a field withoutreservewill produce a silent data-corruption bug.
Common pitfalls#
- Reusing a Protobuf field number. Field 5 used to be
legacy_email; now you’ve assigned it toflags. Old messages on disk become catastrophically wrong. Always usereservedwhen deleting fields:reserved 5; reserved "legacy_email";. - Treating Protobuf scalar defaults as semantically meaningful. In proto3, a missing
int32and anint32 = 0are indistinguishable on the wire. If you need “explicitly zero” vs “absent”, useoptional(proto3.15+) or a wrapper message. - Storing Avro without the schema. Object container files include the schema; raw Avro frames do not. If you store raw frames without registering the schema, the data is unrecoverable.
- Versioning the
.protofile like code. A breaking change in a Protobuf message can ship deployed for months before a client tries to read an old wire. Treat.protofiles like database migrations — additive only, with explicit deprecation timelines. - MessagePack for high-cardinality keys. Field names are on the wire — a million tiny records each with a 30-character key name burn 30MB just on key strings. Protobuf wins this case by
~100x. - Comparing benchmarks without schemas. “JSON is 3x slower than Protobuf” depends on the message shape. Tiny messages with one or two fields: JSON and Protobuf are within
~1.5x. Large nested messages with repeated fields: the gap widens to~10x. - Forgetting endianness. Most binary formats specify endianness explicitly. Custom binary formats that don’t are landmines waiting for an ARM client.
Related building blocks#
- Data Representation and Efficient Communication — the textual-vs-binary trade-off these formats sit on the binary side of.
- Textual Data Formats — JSON, XML, YAML — JSON, XML, YAML. The other side of the trade-off.
- gRPC — Protobuf over HTTP/2 — gRPC is Protobuf’s natural RPC framework.
- Remote Procedure Calls (RPCs) — the wider RPC family that binary formats earn their keep inside.
- REST vs GraphQL vs gRPC — Comparison — where Protobuf-over-gRPC slots into the REST/GraphQL/gRPC comparison.