Text-to-Text Generation Systems — Gen AI

Use cases#

Text-to-text systems are the most common LLM application by a wide margin. Input is text; output is text; the model rewrites, condenses, translates, or restructures. Everything from tldr; of a Slack thread to translating a legal contract to extracting fields from a PDF falls in this category.

The shapes that recur:

Summarization. Long input, short faithful output. Meeting transcripts, articles, support tickets, log dumps. Variants: extractive (pick spans), abstractive (rewrite), bullet-form, structured (TL;DR + key points + action items).
Translation. Source language to target language, ideally preserving register, idiom, and formatting. Modern LLMs match dedicated NMT systems for high-resource pairs and beat them on low-resource pairs by an increasing margin.
Rewriting and tone transfer. Same content, different voice — formalize an email, simplify a legal clause, translate jargon to plain language, adjust reading level, gender-neutralize.
Structured extraction. Free-form input to typed output — extract {name, email, intent, urgency} from a support email; pull line items from a receipt; turn a press release into a database row.
Code-to-code and text-to-code. Refactor, document, port between languages, generate from spec. Borderline application — strict-syntax outputs benefit more from grammar-constrained decoding than open prose does.
Classification and routing. Sentiment, topic, language detection, spam — historically dedicated models, now often a thin LLM prompt against a frontier model for the long tail.

Text-to-text shines when the value is in semantic transformation and the volume is moderate. It struggles when latency budgets are sub-100 ms, when the output must be perfectly faithful (legal contracts, medical records), or when a deterministic regex-and-template solution would suffice.

System overview#

A production text-to-text system is a thin pipeline by agent standards but has real moving parts:

[Input text]
    │
    ▼
[Preprocessing]
    - normalize whitespace, encoding
    - language / format detection
    - length check
    │
    ▼
[Routing]
    - which model: frontier vs. fast vs. open-weights
    - which prompt template
    - batched vs. streaming
    │
    ▼
[Long-input handling]
    - fits in context: pass through
    - exceeds context: chunk + map-reduce
    - very long: hierarchical summarization
    │
    ▼
[LLM call]
    - retries, timeouts, fallback model
    - prompt cache hit on static prefix
    │
    ▼
[Postprocessing]
    - parse / validate structure
    - faithfulness check (optional)
    - format conversion (markdown, JSON, plaintext)
    │
    ▼
[Response]

For volumes past a few thousand calls per day, batching, caching, and routing dominate the engineering work — not the prompt itself.

Key components#

Model selection#

The single biggest cost lever. A frontier model is roughly 10 to 50 times more expensive per token than a fast model from the same provider. The pragmatic taxonomy:

Frontier models (Claude Opus / Sonnet 4.x, GPT-4.x, Gemini 2.x Pro) — top quality, top cost, best on reasoning-heavy summarization and nuanced rewriting.
Fast / mid models (Claude Haiku, GPT-4.x mini, Gemini Flash) — 5 to 10 times cheaper, 2 to 5 times faster, near-frontier quality on routine summarization, translation, and extraction.
Open-weights (Llama 3.x, Qwen 2.x, Mistral, DeepSeek) — self-hosted or hosted on third-party inference. Best when data residency matters, when volume justifies the operational cost, or when a small fine-tuned model beats a large general one.

Many production systems route per-request: easy cases go to the fast model, hard cases (long input, low confidence, named entities) escalate to the frontier model. A classifier or a heuristic decides; the escalation rate is typically 5 to 20%.

Long-input strategies#

When input exceeds the model’s context window, three strategies dominate:

Truncation. Drop the tail. Cheap, lossy, often acceptable for one-pass summarization of moderately-long input.
Map-reduce. Split into chunks, summarize each chunk, then summarize the summaries. Two-pass; quality is good for “what’s the gist” but loses cross-chunk references.
Hierarchical summarization. Recursive map-reduce — summaries of summaries of summaries. For book-length input. Quality varies with chunking strategy and overlap; expensive in tokens but parallelizable.

Modern long-context models (1M+ tokens) make many of these unnecessary for moderate input. But long context is not free — prefill cost grows linearly with input length and attention degrades on the middle of very long prompts (“lost in the middle”). For inputs over 100K tokens, hybrid retrieval-then-summarize often beats raw long-context.

Prompt templates#

Each task type has its own template, version-controlled and tested. A summarization template typically specifies:

The role and target audience (technical, executive, customer).
The output shape (paragraph, bullets, headed sections).
The length budget (in words, sentences, or bullets).
The fidelity rule (“do not introduce facts not in the source”).
A few-shot example or two if the format is unusual.

The translation template adds the target language, the desired register (formal / informal), and the handling of named entities, units, and idioms.

The extraction template includes the JSON schema as an example (not just prose), error handling for missing fields, and rules for ambiguous values.

Streaming and time-to-first-token#

User-facing text generation is judged by perceived latency, not throughput. Two metrics matter:

Time-to-first-token (TTFT). From request to first byte streamed. Dominated by prefill cost (proportional to input length).
Inter-token latency (ITL). Time between subsequent tokens. Dominated by decode speed (per-token forward pass).

Streaming hides decode latency: the user starts reading at TTFT and the rest renders progressively. The user’s reading speed is roughly 4 tokens per second; any ITL faster than that feels instant. TTFT under 500 ms feels snappy; over 2 seconds feels broken.

For batch jobs, streaming is irrelevant — throughput per dollar is what matters.

Caching#

Three places caching helps:

Prompt prefix cache. Provider-native. Static system prompt + few-shot block gets processed once and reused across requests. 5 to 10 times cheaper per cached prefill. Works only if the prefix is verbatim identical and the variable content is at the end.
Full-result cache. For identical input + prompt + model, return the cached output. Works for high-repeat workloads (the same FAQ asked over and over).
Embedding-keyed cache. For semantically-similar input, return a cached output. Risky — small input changes can require different outputs. Used cautiously for autocomplete and suggestion features.

Implementation patterns#

Map-reduce summarization with overlap#

For a long document split into chunks C_1, C_2, ..., C_n, produce per-chunk summaries S_i, then a final summary over [S_1, ..., S_n]. Overlap chunks by 10 to 20% so cross-boundary information isn’t lost. The final summary prompt should be given the same audience and length constraints as the per-chunk ones plus a “merge and deduplicate” instruction.

Structured extraction with grammar-constrained decoding#

For high-volume extraction (millions of documents per day) where the schema is fixed, grammar-constrained decoding is the right tool — see Prompt Engineering for the spectrum from “ask for JSON” to constrained decoding. The win is that you eliminate the schema-validation retry loop entirely; every output is well-formed by construction.

Two-stage routing#

A small fast model classifies the input (“easy summarization” vs. “needs reasoning” vs. “out of scope”), and the second stage uses a task-specific prompt with the appropriate model. The router is cheap; the routing decisions cut the frontier-model spend dramatically.

Translation with terminology glossaries#

For domain translation (medical, legal, technical), pass a glossary of source-target term pairs in the system prompt and instruct the model to use them. Quality improves significantly over zero-shot translation and matches the consistency that human translators get from termbases.

Faithfulness verification#

After generating a summary, run a verifier prompt — “for each claim in this summary, find the supporting sentence in the source; flag claims without support.” Expensive (doubles cost), but it catches a substantial fraction of hallucinations on summarization tasks. Often run only on a sample for monitoring, not on every request.

Trade-offs#

Hosted frontier model — best quality, simplest to ship, per-token cost. Latency depends on the provider. Data leaves your perimeter. Capability scales with the model.

Self-hosted open-weights — capex on inference infrastructure, lower marginal cost at volume, full control over data residency. Quality gap is closing fast on routine tasks; still real on reasoning-heavy ones.

Other axes:

One general prompt vs. many specialized prompts. A single “summarize this” prompt is easy to maintain; per-task prompts (meetings, articles, tickets) produce better output per task type. The maintenance cost grows with the number of templates; most teams converge on 3 to 8 templates per product.
Synchronous vs. batched inference. Interactive endpoints answer one at a time. Batched endpoints process queues of inputs with much better throughput per GPU. Use batch APIs for any non-interactive workflow; the cost savings are 30 to 50%.
Output as plaintext vs. structured. Plaintext is easy to consume by humans and hard by code. JSON is the inverse. Pick per consumer; don’t pretend a single output works for both.

Quality and evaluation#

Each task type needs its own evaluation. The shared pieces:

Held-out evaluation set. A representative sample of real inputs with curated reference outputs. Sized for statistical reliability (a few hundred items minimum).
Task-appropriate metrics.
- Summarization: faithfulness (every claim supported), coverage (key points retained), conciseness (length vs. baseline). Reference-based metrics (ROUGE) are weak signals; LLM-as-judge graders are now standard.
- Translation: BLEU and chrF for high-resource pairs, COMET (a learned metric) for semantic adequacy, human evaluation for low-resource pairs.
- Extraction: field-level precision and recall against gold annotations. F1 is the headline number.
- Rewriting: rubric-graded (preserves meaning, hits target tone, no added facts).
Production telemetry. Pass rates, retry rates, refusal rates, latency, cost per request. Edge-case clusters get fed back into the eval set.

The pattern that ages well: a frozen eval set + LLM-as-judge + a weekly review of judge disagreements. The judge prompt has its own version control; it can drift just like the production prompt.

Common pitfalls#

Choosing the model before measuring the task. The right model depends on the task. Run the eval set on three candidates (fast / mid / frontier) before committing.
Ignoring TTFT in user-facing apps. Long prompts kill TTFT. If a feature is interactive, prefill cost is the design constraint, not output length.
Truncating at the wrong end. For “answer the user’s question about this document”, the relevant content is often at the end of the document, not the start. Truncating the tail by default silently regresses quality.
No faithfulness check. Models add plausible-sounding facts to summaries when the source is sparse. Without a verifier, you ship hallucinations.
Charging on output tokens but not measuring them. Output length is a cost lever the model controls. Set explicit length budgets in prompts (“at most 6 bullets, under 80 words total”) and verify; otherwise spend grows unpredictably with prompt drift.
One language assumption. Default English produces broken output on Japanese input. Pin the response language to the input language or to an explicit user setting.
Markdown vs. plaintext confusion. Models love markdown; downstream consumers (Slack, email, CRM fields) often render it as raw asterisks. Render in the producer, not the consumer.

The one technique that changed translation quality overnight

For domain translation — medical, legal, technical, marketing — adding a per-domain glossary of 20 to 200 source-target term pairs to the system prompt is the highest-leverage change you can make. The model uses the glossary as a hard constraint and produces consistent terminology across millions of documents. Without it, “patient outcome” might be rendered three different ways in the same chapter. The glossary lives in source control next to the prompt; product owners maintain it; translations stabilize. The technique is so effective that some teams stop tuning the prompt entirely after adding it and just maintain the glossary.