Binary Data Formats — Protobuf, MessagePack, Avro — API Design

What it is#

A binary data format encodes structured data as a compact byte stream rather than human-readable text. The wire is opaque — you cannot curl an endpoint and read the response — but the bytes are smaller, parsing is faster, and the schema (when present) catches mistakes at compile time that JSON would only catch in production.

Five binary formats matter in practice in 2026:

Protocol Buffers (Protobuf) — Google, 2008 (public). Schema-first via .proto files; tag-length-value encoding with small-integer field numbers. The default wire format for gRPC and the default internal RPC format at every large polyglot company.
MessagePack — open-source, 2008. JSON-compatible binary: same data model (objects, arrays, primitives) but a compact binary encoding. No schema required. Used by Redis (MSGPACK module), Pinterest, Treasure Data.
Apache Avro — Hadoop ecosystem, 2009. Schema-with-data: the schema travels with the file (object container format) or is negotiated at the start of an RPC stream. JSON-defined schemas, dynamic typing at decode time. Standard for Kafka payloads via Confluent Schema Registry.
Cap’n Proto — Sandstorm, 2013. Zero-copy decoding — the on-disk layout is the in-memory layout, so reading a field is a pointer arithmetic operation, not a parse. Used where parse latency is dominant.
FlatBuffers — Google, 2014. Same zero-copy property as Cap’n Proto, originally built for games (Android) where parse-time GC pauses ruin frame rates.

The three that come up most in API-design interviews are Protobuf, MessagePack, and Avro — and the differences between them are the substance of the topic.

When to use it#

Reach for Protobuf for internal RPC where you control both sides of the wire, where a schema-registry exists or can be added, and where polyglot code generation is a win. Default for gRPC, the natural pick for microservice meshes.
Reach for MessagePack when you want a binary format with JSON semantics — no schema, dynamic decoding, drop-in replacement for json.dumps. Caches (Redis), local IPC, embedded contexts, places where introducing a .proto file is overkill.
Reach for Avro when the data is streamed or stored rather than RPC’d — Kafka topics, Hadoop files, columnar data lakes. The schema-with-data model fits batch and stream workloads where the consumer learns the schema at read time, not link time.
Reach for FlatBuffers / Cap’n Proto when decode latency dominates — high-frequency trading, game engines, mobile UIs decoding large payloads. The zero-copy property lets you read a single field without parsing the whole message.
Avoid binary formats for public APIs, browser-facing endpoints, debug logs, and configuration. The cost of “you can’t curl the response” outweighs the size win when humans are in the loop.

How it works#

The same logical message — a user record with three fields — across the three primary formats.

Protobuf: schema-first, field-numbered#

syntax = "proto3";
package example.v1;

message User {
  string id = 1;
  string email = 2;
  int32 created_unix = 3;
}

The .proto file is compiled to language-specific stubs (protoc --python_out=. --go_out=. --js_out=. user.proto). On the wire, a User { id: "u_42", email: "a@b.co", created_unix: 1717000000 } becomes roughly:

0a 04 75 5f 34 32     // field 1 (id), length 4, "u_42"
12 06 61 40 62 2e 63 6f  // field 2 (email), length 6, "a@b.co"
18 80 e4 c7 b8 06     // field 3 (created_unix), varint 1717000000

Field numbers are the wire identity. Renaming email to user_email in the .proto file is free; renumbering field 2 to field 4 silently corrupts every existing message on disk. Field-number stability is the first rule of Protobuf evolution.

MessagePack: schema-less, JSON-compatible#

import msgpack
data = {"id": "u_42", "email": "a@b.co", "created_unix": 1717000000}
wire = msgpack.packb(data)
# Roughly 35 bytes:
# 83  (3-element map)
# a2 69 64 a4 75 5f 34 32  ("id", "u_42")
# a5 65 6d 61 69 6c a6 61 40 62 2e 63 6f  ("email", "a@b.co")
# ac 63 72 65 61 74 65 64 5f 75 6e 69 78 ce 66 5f 1f 80  ("created_unix", 1717000000)

Field names appear on every message (the id and email strings are there). No schema is required, but the cost is repeated string keys — for messages with many small repeated objects, MessagePack approaches JSON’s overhead.

Avro: schema-with-data, dynamic decoding#

{
  "type": "record",
  "name": "User",
  "namespace": "example.v1",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "created_unix", "type": "int"}
  ]
}

Avro encodes the values in schema order with no field tags — the bytes are just len(id) | id | len(email) | email | varint(created_unix). The decoder must have the writer schema to make sense of them.

In an Avro object container file (.avro), the schema is the first thing in the file. In an Avro RPC stream (or Kafka via Confluent Schema Registry), a small schema ID is embedded in each message and the registry holds the schema. The result: messages are nearly as compact as Protobuf, but no compile-time stub generation is required — the decoder uses the schema dynamically.

Three-language serialize/deserialize with Protobuf#

The shape of a real RPC handler: deserialize a request, do something, serialize a response.

# pip install protobuf
# protoc --python_out=. user.proto  → user_pb2.py
from user_pb2 import User

def encode(u_id: str, email: str, ts: int) -> bytes:
    u = User(id=u_id, email=email, created_unix=ts)
    return u.SerializeToString()  # ~25 bytes for the example record

def decode(wire: bytes) -> dict:
    u = User()
    u.ParseFromString(wire)
    return {"id": u.id, "email": u.email, "created_unix": u.created_unix}

// protoc --go_out=. user.proto  → user.pb.go
package main

import (
    "google.golang.org/protobuf/proto"
    pb "example.com/gen/example/v1"
)

func encode(id, email string, ts int32) ([]byte, error) {
    u := &pb.User{Id: id, Email: email, CreatedUnix: ts}
    return proto.Marshal(u)
}

func decode(wire []byte) (*pb.User, error) {
    u := &pb.User{}
    err := proto.Unmarshal(wire, u)
    return u, err
}

// npm install protobufjs
// Or generate stubs with `protoc --js_out=...`
const protobuf = require("protobufjs");

async function load() {
  const root = await protobuf.load("user.proto");
  return root.lookupType("example.v1.User");
}

async function encode(id, email, ts) {
  const User = await load();
  const msg = User.create({ id, email, createdUnix: ts });
  return User.encode(msg).finish(); // Uint8Array
}

async function decode(wire) {
  const User = await load();
  return User.decode(wire);
}

Schema evolution#

Each format has its own rules for what you can and cannot change without breaking deployed clients.

Change	Protobuf	Avro	MessagePack
Add an optional field	Safe (assign new field number; old decoders ignore)	Safe (add to schema, provide a default)	Safe (new key in map; old decoders ignore)
Remove a field	Safe if you `reserve` the field number	Safe if the field has a default and reader knows about it	Safe (the key is just missing)
Rename a field	Safe (the name is metadata; the wire uses the number)	Breaking (the name is on the wire)	Breaking (the name is on the wire)
Renumber a field	Breaking	N/A (no numbers)	N/A
Change a field’s type	Limited — `int32 → int64` is safe, `string → bytes` is safe; most others are breaking	Limited — only certain promotions are compatible	Safe at the wire level; breaks the consumer’s expectations
Make a required field optional	Safe in proto3 (all fields are implicitly optional)	Safe (add a default)	N/A

Protobuf’s superpower is the field-number stability rule. Decoders are forward and backward compatible by construction as long as you never reuse a field number. Avro’s superpower is reader-writer schema resolution — the same byte stream can be read by a decoder with a slightly different schema, and Avro defines exactly which differences are compatible.

Variants#

Beyond the big three, four formats fill specific niches:

Format	Origin	Niche
Cap’n Proto	Kenton Varda (ex-Protobuf), 2013	Zero-copy decoding for ultra-low-latency RPC.
FlatBuffers	Google, 2014	Zero-copy for games/mobile. Used by Android Wear, Cocos2d.
BSON	MongoDB, 2009	Binary JSON with extra types (Date, ObjectId). Internal to MongoDB.
CBOR	RFC 8949, 2013	”Binary JSON” with IETF backing. Used in COSE, CoAP, WebAuthn.
Thrift	Facebook, 2007	Schema-first like Protobuf, paired with the Thrift RPC framework. Largely superseded by gRPC + Protobuf.

Trade-offs#

What binary formats buy you:

2-5x smaller payloads on typical records (more on records with many small fields).
3-10x faster parsing in most languages. Protobuf and FlatBuffers benchmark consistently faster than encoding/json in Go, simdjson in C++, and JSON.parse in V8.
Compile-time schema validation (Protobuf, Thrift, FlatBuffers, Cap’n Proto). A typo in a field name is a build error, not a runtime KeyError.
Polyglot code generation. One .proto file, stubs in twelve languages.

What they cost:

Opacity on the wire. Debugging requires protoc --decode_raw or a Wireshark plugin. The on-call experience is worse.
Schema-registry infrastructure. Avro effectively requires a schema registry in production (Confluent’s is standard). Protobuf doesn’t require one, but a .proto repository is its informal equivalent.
Tooling weight. A new language must add Protobuf support before it can join the mesh. JSON works in every language for free.
Evolution discipline. Field-number stability rules require team discipline and code review. A .proto file that lets anyone delete a field without reserve will produce a silent data-corruption bug.

Common pitfalls#

Reusing a Protobuf field number. Field 5 used to be legacy_email; now you’ve assigned it to flags. Old messages on disk become catastrophically wrong. Always use reserved when deleting fields: reserved 5; reserved "legacy_email";.
Treating Protobuf scalar defaults as semantically meaningful. In proto3, a missing int32 and an int32 = 0 are indistinguishable on the wire. If you need “explicitly zero” vs “absent”, use optional (proto3.15+) or a wrapper message.
Storing Avro without the schema. Object container files include the schema; raw Avro frames do not. If you store raw frames without registering the schema, the data is unrecoverable.
Versioning the .proto file like code. A breaking change in a Protobuf message can ship deployed for months before a client tries to read an old wire. Treat .proto files like database migrations — additive only, with explicit deprecation timelines.
MessagePack for high-cardinality keys. Field names are on the wire — a million tiny records each with a 30-character key name burn 30MB just on key strings. Protobuf wins this case by ~100x.
Comparing benchmarks without schemas. “JSON is 3x slower than Protobuf” depends on the message shape. Tiny messages with one or two fields: JSON and Protobuf are within ~1.5x. Large nested messages with repeated fields: the gap widens to ~10x.
Forgetting endianness. Most binary formats specify endianness explicitly. Custom binary formats that don’t are landmines waiting for an ARM client.

Data Representation and Efficient Communication — the textual-vs-binary trade-off these formats sit on the binary side of.
Textual Data Formats — JSON, XML, YAML — JSON, XML, YAML. The other side of the trade-off.
gRPC — Protobuf over HTTP/2 — gRPC is Protobuf’s natural RPC framework.
Remote Procedure Calls (RPCs) — the wider RPC family that binary formats earn their keep inside.
REST vs GraphQL vs gRPC — Comparison — where Protobuf-over-gRPC slots into the REST/GraphQL/gRPC comparison.