Textual Data Formats — JSON, XML, YAML
The trio every API designer must read fluently. Why JSON won, where XML still lives, what YAML costs in subtle bugs.
What it is#
The three textual wire formats every API designer reads fluently are JSON, XML, and YAML. All three are human-readable. All three carry their schema in the bytes (field names appear on every message). All three have been used to ship public APIs at scale. They differ in what they make easy, what they make hard, and what they make dangerous.
- JSON — JavaScript Object Notation, standardised as RFC 8259. Born inside JavaScript runtimes around 2001, popularised by Douglas Crockford, and the lingua franca of the public-facing web by 2010. Four primitives (string, number, boolean, null), two collections (object, array), and a grammar that fits on a postcard.
- XML — eXtensible Markup Language, a W3C recommendation from 1998. Tags, attributes, namespaces, schemas (XSD), transformations (XSLT), queries (XPath), validation. A complete typed-document ecosystem; the heaviest of the three by far.
- YAML — YAML Ain’t Markup Language, version 1.2 since 2009. A superset of JSON with whitespace-significant syntax, anchors and aliases, multi-document streams, and a famously rich set of implicit type coercions. Dominates configuration files (Kubernetes, GitHub Actions, Docker Compose); not used as an RPC wire format.
The three formats compete on the same axes: size, parse speed, schema sophistication, and how surprising the parser is. JSON optimises for size and parser predictability. XML optimises for schema sophistication. YAML optimises for hand-authoring. Each trade-off has a cost; the cost of YAML’s hand-authoring optimisation is what gets it banned from many production payloads.
When to use it#
- Reach for JSON for public REST and HTTP APIs, browser-facing endpoints, mobile API responses, logging, and any payload a developer will read in a terminal. JSON is the unambiguous default for new external APIs in 2026.
- Reach for XML when integrating with legacy enterprise systems (SOAP, financial-services APIs, healthcare HL7, government data exchanges), when you genuinely need namespaces (multiple vocabularies mixed in one document), or when the toolchain is XSD-validated end-to-end. Don’t ship a new public REST API in XML in 2026 unless a partner mandates it.
- Reach for YAML for configuration files humans will edit by hand — Kubernetes manifests, CI pipelines, infrastructure-as-code. Avoid YAML as an API wire format; the implicit-typing gotchas are too easily weaponised.
A useful rule: JSON for machine-to-machine, YAML for human-to-machine, XML when there’s no choice.
How it works#
The same logical record — an order with two line items — in all three formats:
{ "order_id": "ord_a3f9c2", "status": "confirmed", "country": "NO", "amount": { "currency": "USD", "value_minor": 4999 }, "items": [ { "sku": "SKU-X42", "qty": 1 }, { "sku": "SKU-Y19", "qty": 2 } ], "created_at": "2026-05-30T08:14:23Z"}<order> <order_id>ord_a3f9c2</order_id> <status>confirmed</status> <country>NO</country> <amount currency="USD" value_minor="4999"/> <items> <item sku="SKU-X42" qty="1"/> <item sku="SKU-Y19" qty="2"/> </items> <created_at>2026-05-30T08:14:23Z</created_at></order>order_id: ord_a3f9c2status: confirmedcountry: NO # <-- danger: implicit-cast to false in YAML 1.1amount: currency: USD value_minor: 4999items: - sku: SKU-X42 qty: 1 - sku: SKU-Y19 qty: 2created_at: 2026-05-30T08:14:23ZAll three convey the same data. The XML version is ~30% larger by character count; the YAML version is ~10% smaller. JSON sits in the middle and parses fastest in every mainstream runtime.
Round-tripping in three languages#
Reading and writing JSON in the three languages most APIs are integrated from. Real-world shape — error-handling and timeouts included.
import jsonfrom datetime import datetime
# Decoderaw = '{"order_id":"ord_a3f9c2","amount":{"value_minor":4999}}'obj = json.loads(raw)print(obj["amount"]["value_minor"]) # 4999
# Encodeout = json.dumps({ "order_id": "ord_a3f9c2", "created_at": datetime.utcnow().isoformat() + "Z", "items": [{"sku": "SKU-X42", "qty": 1}],}, separators=(",", ":")) # compact for the wirepackage main
import ( "encoding/json" "time")
type Amount struct { Currency string `json:"currency"` ValueMinor int64 `json:"value_minor"`}type Order struct { OrderID string `json:"order_id"` Amount Amount `json:"amount"` CreatedAt time.Time `json:"created_at"`}
func roundtrip(raw []byte) ([]byte, error) { var o Order if err := json.Unmarshal(raw, &o); err != nil { return nil, err } return json.Marshal(o)}function roundtrip(raw) { const obj = JSON.parse(raw); // BigInt warning: numbers above 2^53 lose precision through JSON.parse. // Stripe and GitHub ship their 64-bit IDs as strings for this reason. return JSON.stringify(obj);}
const out = JSON.stringify({ order_id: "ord_a3f9c2", created_at: new Date().toISOString(), items: [{ sku: "SKU-X42", qty: 1 }],});JSON: the format that won#
JSON’s win is structural, not aesthetic. It is JavaScript-native (every browser parses it without a library), minimal (the grammar fits on one page), and agnostic to namespace problems (no <foo:bar xmlns:foo="..."> ceremony). It pays for these wins with three real costs:
- No native date type. ISO-8601 strings are the convention; the parser will not enforce them.
- No native binary type. Base64-encoded strings are the convention; payloads grow
~33%. - Number precision is implementation-defined — JavaScript clients silently truncate values above
2^53 - 1.
JSON Schema (a separate spec, not part of RFC 8259) adds optional validation. OpenAPI uses a subset of JSON Schema to describe API payloads at the contract level.
XML: still standard in enterprise#
XML’s superpowers are namespaces (mix two vocabularies in one document without collision), XSD (a typed schema language richer than JSON Schema), and XSLT/XPath (transformations and queries native to the format). Where these matter — banking (FpML, ISO 20022), healthcare (HL7 v3), government data exchange, SOAP services — XML is still the right answer. Where they don’t matter, XML costs verbosity, parser surface area (XXE attacks, billion-laughs DoS), and developer enthusiasm.
A SOAP envelope is XML by definition. If your partner says “we expose a SOAP API,” you are shipping XML.
YAML: configuration’s lingua franca, and its footguns#
YAML 1.2 specifies the format reasonably tightly. YAML 1.1 — which most parsers still default to — has a famous set of implicit type coercions that cause real production bugs:
- The Norway problem.
country: NOparses to false in YAML 1.1 (becauseNOis a boolean). Norway’s country code isNO. Yes, this has shipped to production. version: 1.0parses to the float1.0, not the string"1.0".version: 1.10may parse as1.1.time: 22:30parses to a sexagesimal integer (22*60 + 30 = 1350).on,off,yes,no,y,nall parse as booleans.
The fix is to always quote string-typed values, or to use YAML 1.2 parsers configured to disable implicit-typing (PyYAML’s safe_load with the core schema, js-yaml with JSON_SCHEMA). Whitespace-significance is a second class of footgun — a tab character where YAML expected spaces is a parse error with a not-always-helpful message.
Variants#
Three families worth knowing about:
| Variant | Lineage | When it fits |
|---|---|---|
| JSON5 | Superset of JSON allowing comments, trailing commas, unquoted keys. | Hand-authored JSON config. Not for wire payloads. |
| JSONC | JSON with comments. VS Code’s settings.json. | Editor config. |
| NDJSON / JSON Lines | One JSON object per line, no enclosing array. | Streaming, log shipping. Splunk, Elasticsearch bulk API. |
| SOAP | XML wrapped in a SOAP envelope, schema via WSDL. | Enterprise, legacy partner integrations. |
| XHTML | XML-strict subset of HTML. | Largely dead; HTML5 superseded it. |
| TOML | Hand-authored config alternative to YAML; less ambiguous. | Rust’s Cargo, Python’s pyproject. |
Trade-offs#
What you get from each:
- JSON — universal tooling, fast parsing, terse, supported by every language and every HTTP client. Cost: no native dates or binary, number-precision surprises.
- XML — rich schema (XSD), namespaces, mature query languages (XPath, XSLT). Cost: verbose (1.5-2x JSON size), parser-attack surface (XXE), developer hostility, mostly absent from greenfield design.
- YAML — hand-friendly indentation, comments, anchors and aliases for DRY config, multi-document streams. Cost: implicit-typing footguns, parser-divergence between YAML 1.1 and 1.2, whitespace sensitivity, ambiguous error messages.
The cross-cutting cost of all three: 3-10x larger and 5-20x slower to parse than equivalent Protobuf. At small scale, invisible; at large scale, the reason internal services move off textual formats.
Common pitfalls#
- Trusting the parser with money.
9007199254740993round-trips through Python intact; through JavaScript it becomes9007199254740992. Ship 64-bit IDs and currency-minor values as JSON strings if a JS client will touch them. - Forgetting JSON has no comments. Putting
// ...in a JSON payload is a parse error in every conformant parser. Use JSON5 or YAML if you need comments — at the cost of the format being non-standard. - Letting YAML 1.1 implicit casts reach production.
country: NO,version: 1.0,enabled: onare bugs in waiting. Quote strings, or use 1.2 with the core schema. - Parsing XML without disabling external entities. XXE (XML External Entity) attacks remain in the OWASP Top 10. Every mainstream XML parser ships with external-entity resolution enabled by default for backwards compatibility; turn it off explicitly.
- Streaming JSON arrays.
[{...}, {...}, ...]requires the parser to buffer the whole array to validate. Use NDJSON (one object per line) for streaming workloads. - Mixing
nulland missing keys in JSON without deciding a convention.{"x": null}and{}are different; the contract should say which one your API uses for “unset”. - Schema drift between language bindings. Go’s
encoding/jsonis case-insensitive by default; Python’sjsonis case-sensitive. Document the exact field names and decide whether unknown fields are ignored or rejected.
Related building blocks#
- Data Representation and Efficient Communication — the textual-vs-binary trade-off these three formats sit inside.
- Binary Data Formats — Protobuf, MessagePack, Avro — Protobuf, MessagePack, Avro. The other side of the trade-off.
- HTTP — The Foundational Protocol for APIs —
Content-Type: application/jsonand friends. - REST — The Architectural Style — REST and JSON grew up together; the relationship is not accidental.
- RESTful API Design in Practice — practical patterns for JSON-shaped REST APIs.