Binary Data Formats — Protobuf, MessagePack, Avro

When the wire matters more than the diff. Schema-first vs schema-less, evolution semantics, the size-vs-debuggability trade-off.

Building Block Intermediate
10 min read
protobuf messagepack avro serialization binary

What it is#

A binary data format encodes structured data as a compact byte stream rather than human-readable text. The wire is opaque — you cannot curl an endpoint and read the response — but the bytes are smaller, parsing is faster, and the schema (when present) catches mistakes at compile time that JSON would only catch in production.

Five binary formats matter in practice in 2026:

  • Protocol Buffers (Protobuf) — Google, 2008 (public). Schema-first via .proto files; tag-length-value encoding with small-integer field numbers. The default wire format for gRPC and the default internal RPC format at every large polyglot company.
  • MessagePack — open-source, 2008. JSON-compatible binary: same data model (objects, arrays, primitives) but a compact binary encoding. No schema required. Used by Redis (MSGPACK module), Pinterest, Treasure Data.
  • Apache Avro — Hadoop ecosystem, 2009. Schema-with-data: the schema travels with the file (object container format) or is negotiated at the start of an RPC stream. JSON-defined schemas, dynamic typing at decode time. Standard for Kafka payloads via Confluent Schema Registry.
  • Cap’n Proto — Sandstorm, 2013. Zero-copy decoding — the on-disk layout is the in-memory layout, so reading a field is a pointer arithmetic operation, not a parse. Used where parse latency is dominant.
  • FlatBuffers — Google, 2014. Same zero-copy property as Cap’n Proto, originally built for games (Android) where parse-time GC pauses ruin frame rates.

The three that come up most in API-design interviews are Protobuf, MessagePack, and Avro — and the differences between them are the substance of the topic.

When to use it#

  • Reach for Protobuf for internal RPC where you control both sides of the wire, where a schema-registry exists or can be added, and where polyglot code generation is a win. Default for gRPC, the natural pick for microservice meshes.
  • Reach for MessagePack when you want a binary format with JSON semantics — no schema, dynamic decoding, drop-in replacement for json.dumps. Caches (Redis), local IPC, embedded contexts, places where introducing a .proto file is overkill.
  • Reach for Avro when the data is streamed or stored rather than RPC’d — Kafka topics, Hadoop files, columnar data lakes. The schema-with-data model fits batch and stream workloads where the consumer learns the schema at read time, not link time.
  • Reach for FlatBuffers / Cap’n Proto when decode latency dominates — high-frequency trading, game engines, mobile UIs decoding large payloads. The zero-copy property lets you read a single field without parsing the whole message.
  • Avoid binary formats for public APIs, browser-facing endpoints, debug logs, and configuration. The cost of “you can’t curl the response” outweighs the size win when humans are in the loop.

How it works#

The same logical message — a user record with three fields — across the three primary formats.

Protobuf: schema-first, field-numbered#

user.proto
syntax = "proto3";
package example.v1;
message User {
string id = 1;
string email = 2;
int32 created_unix = 3;
}

The .proto file is compiled to language-specific stubs (protoc --python_out=. --go_out=. --js_out=. user.proto). On the wire, a User { id: "u_42", email: "a@b.co", created_unix: 1717000000 } becomes roughly:

0a 04 75 5f 34 32 // field 1 (id), length 4, "u_42"
12 06 61 40 62 2e 63 6f // field 2 (email), length 6, "a@b.co"
18 80 e4 c7 b8 06 // field 3 (created_unix), varint 1717000000

Field numbers are the wire identity. Renaming email to user_email in the .proto file is free; renumbering field 2 to field 4 silently corrupts every existing message on disk. Field-number stability is the first rule of Protobuf evolution.

MessagePack: schema-less, JSON-compatible#

MessagePack — same record
import msgpack
data = {"id": "u_42", "email": "a@b.co", "created_unix": 1717000000}
wire = msgpack.packb(data)
# Roughly 35 bytes:
# 83 (3-element map)
# a2 69 64 a4 75 5f 34 32 ("id", "u_42")
# a5 65 6d 61 69 6c a6 61 40 62 2e 63 6f ("email", "a@b.co")
# ac 63 72 65 61 74 65 64 5f 75 6e 69 78 ce 66 5f 1f 80 ("created_unix", 1717000000)

Field names appear on every message (the id and email strings are there). No schema is required, but the cost is repeated string keys — for messages with many small repeated objects, MessagePack approaches JSON’s overhead.

Avro: schema-with-data, dynamic decoding#

user.avsc — Avro schema
{
"type": "record",
"name": "User",
"namespace": "example.v1",
"fields": [
{"name": "id", "type": "string"},
{"name": "email", "type": "string"},
{"name": "created_unix", "type": "int"}
]
}

Avro encodes the values in schema order with no field tags — the bytes are just len(id) | id | len(email) | email | varint(created_unix). The decoder must have the writer schema to make sense of them.

In an Avro object container file (.avro), the schema is the first thing in the file. In an Avro RPC stream (or Kafka via Confluent Schema Registry), a small schema ID is embedded in each message and the registry holds the schema. The result: messages are nearly as compact as Protobuf, but no compile-time stub generation is required — the decoder uses the schema dynamically.

Three-language serialize/deserialize with Protobuf#

The shape of a real RPC handler: deserialize a request, do something, serialize a response.

Protobuf — Python
# pip install protobuf
# protoc --python_out=. user.proto → user_pb2.py
from user_pb2 import User
def encode(u_id: str, email: str, ts: int) -> bytes:
u = User(id=u_id, email=email, created_unix=ts)
return u.SerializeToString() # ~25 bytes for the example record
def decode(wire: bytes) -> dict:
u = User()
u.ParseFromString(wire)
return {"id": u.id, "email": u.email, "created_unix": u.created_unix}

Schema evolution#

Each format has its own rules for what you can and cannot change without breaking deployed clients.

ChangeProtobufAvroMessagePack
Add an optional fieldSafe (assign new field number; old decoders ignore)Safe (add to schema, provide a default)Safe (new key in map; old decoders ignore)
Remove a fieldSafe if you reserve the field numberSafe if the field has a default and reader knows about itSafe (the key is just missing)
Rename a fieldSafe (the name is metadata; the wire uses the number)Breaking (the name is on the wire)Breaking (the name is on the wire)
Renumber a fieldBreakingN/A (no numbers)N/A
Change a field’s typeLimited — int32 → int64 is safe, string → bytes is safe; most others are breakingLimited — only certain promotions are compatibleSafe at the wire level; breaks the consumer’s expectations
Make a required field optionalSafe in proto3 (all fields are implicitly optional)Safe (add a default)N/A

Protobuf’s superpower is the field-number stability rule. Decoders are forward and backward compatible by construction as long as you never reuse a field number. Avro’s superpower is reader-writer schema resolution — the same byte stream can be read by a decoder with a slightly different schema, and Avro defines exactly which differences are compatible.

Variants#

Beyond the big three, four formats fill specific niches:

FormatOriginNiche
Cap’n ProtoKenton Varda (ex-Protobuf), 2013Zero-copy decoding for ultra-low-latency RPC.
FlatBuffersGoogle, 2014Zero-copy for games/mobile. Used by Android Wear, Cocos2d.
BSONMongoDB, 2009Binary JSON with extra types (Date, ObjectId). Internal to MongoDB.
CBORRFC 8949, 2013”Binary JSON” with IETF backing. Used in COSE, CoAP, WebAuthn.
ThriftFacebook, 2007Schema-first like Protobuf, paired with the Thrift RPC framework. Largely superseded by gRPC + Protobuf.

Trade-offs#

What binary formats buy you:

  • 2-5x smaller payloads on typical records (more on records with many small fields).
  • 3-10x faster parsing in most languages. Protobuf and FlatBuffers benchmark consistently faster than encoding/json in Go, simdjson in C++, and JSON.parse in V8.
  • Compile-time schema validation (Protobuf, Thrift, FlatBuffers, Cap’n Proto). A typo in a field name is a build error, not a runtime KeyError.
  • Polyglot code generation. One .proto file, stubs in twelve languages.

What they cost:

  • Opacity on the wire. Debugging requires protoc --decode_raw or a Wireshark plugin. The on-call experience is worse.
  • Schema-registry infrastructure. Avro effectively requires a schema registry in production (Confluent’s is standard). Protobuf doesn’t require one, but a .proto repository is its informal equivalent.
  • Tooling weight. A new language must add Protobuf support before it can join the mesh. JSON works in every language for free.
  • Evolution discipline. Field-number stability rules require team discipline and code review. A .proto file that lets anyone delete a field without reserve will produce a silent data-corruption bug.

Common pitfalls#

  • Reusing a Protobuf field number. Field 5 used to be legacy_email; now you’ve assigned it to flags. Old messages on disk become catastrophically wrong. Always use reserved when deleting fields: reserved 5; reserved "legacy_email";.
  • Treating Protobuf scalar defaults as semantically meaningful. In proto3, a missing int32 and an int32 = 0 are indistinguishable on the wire. If you need “explicitly zero” vs “absent”, use optional (proto3.15+) or a wrapper message.
  • Storing Avro without the schema. Object container files include the schema; raw Avro frames do not. If you store raw frames without registering the schema, the data is unrecoverable.
  • Versioning the .proto file like code. A breaking change in a Protobuf message can ship deployed for months before a client tries to read an old wire. Treat .proto files like database migrations — additive only, with explicit deprecation timelines.
  • MessagePack for high-cardinality keys. Field names are on the wire — a million tiny records each with a 30-character key name burn 30MB just on key strings. Protobuf wins this case by ~100x.
  • Comparing benchmarks without schemas. “JSON is 3x slower than Protobuf” depends on the message shape. Tiny messages with one or two fields: JSON and Protobuf are within ~1.5x. Large nested messages with repeated fields: the gap widens to ~10x.
  • Forgetting endianness. Most binary formats specify endianness explicitly. Custom binary formats that don’t are landmines waiting for an ARM client.
Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.