Text-to-Speech Generation Systems
Neural TTS, voice cloning, prosody, the streaming-audio pipeline. What real-time voice products are actually doing.
Use cases#
A text-to-speech (TTS) system turns text into audio that sounds like a person speaking. The 2010s versions (concatenative, then neural vocoders with predefined voices) were good enough for navigation and screen readers. The 2020s versions — ElevenLabs, OpenAI’s voice models, Google’s Chirp 3, Microsoft’s VALL-E line — produce voices that are hard to distinguish from human recordings, can be cloned from seconds of reference audio, and stream with latency low enough for real-time dialogue.
The shapes that recur:
- Voice agents. Phone systems, drive-throughs, customer support bots, voice assistants. Real-time bidirectional audio; latency budget is harsh (sub-500 ms end-to-end).
- Content narration. Audiobooks, podcasts, video voiceover, e-learning. Offline; quality and consistency matter more than latency.
- Accessibility. Screen readers, reading aids, communication tools for non-speaking users. Trades cutting-edge quality for stability, language coverage, and offline support.
- Localization. Dubbing existing video into other languages while preserving the speaker’s voice. Combines voice cloning with cross-lingual synthesis.
- Game characters and dynamic dialogue. Runtime synthesis for player-specific content. Mid-latency budget; voice consistency across thousands of lines is the hard problem.
- Custom branded voices. A product or company “voice” used across all generated audio. Cloned once; reused everywhere.
TTS is a poor fit when the audio must contain musical content (different stack — see Audio and Music Generation), when the production needs frame-accurate timing to existing video, or when the deployment context cannot tolerate any synthesized-voice artifacts.
System overview#
A modern neural TTS system has three stages — text frontend, acoustic model, and vocoder — though increasingly the boundaries blur into end-to-end systems:
[Input text + voice ID + style hints] │ ▼[Text frontend] - normalization (numbers, dates, abbreviations) - tokenization - phonemization (text -> phonemes, optional) - language / script detection │ ▼[Conditioning] - voice / speaker embedding (from a reference clip or a stored profile) - style / emotion vector - prosody hints (SSML, punctuation cues) │ ▼[Acoustic model] - autoregressive token model OR - non-autoregressive (FastSpeech-style) OR - diffusion (NaturalSpeech, Tortoise variants) output: mel-spectrogram or discrete audio tokens │ ▼[Vocoder / decoder] - mel -> waveform (HiFi-GAN, BigVGAN) OR - tokens -> waveform (SoundStream, EnCodec) output: 24 kHz or 48 kHz PCM │ ▼[Postprocessing] - level normalization, EQ - chunk-boundary smoothing for streaming - watermark │ ▼[Audio out — streamed or downloaded]The total parameter count is much smaller than a frontier LLM (often under a billion parameters), so single-GPU latency is dominated by autoregressive decode and vocoder pass, not by raw model size.
Key components#
Text frontend#
The unglamorous half. Bad normalization is the single most common quality bug:
- Numbers, dates, currency, time. “$3.50 on 1/2/25” must become spoken text. Locale-aware: 1/2/25 is January 2nd in the US and February 1st in most of the world.
- Abbreviations. “Dr.”, “St.”, “Inc.”, “etc.” — context-dependent expansion (“Dr. Smith” vs. “the dr. (drive)”).
- Acronyms vs. initialisms. “NASA” is read as a word, “FBI” as letters. Many models include classifier heads for this.
- Punctuation as prosody. Periods, commas, ellipses, em-dashes all map to pauses of different lengths. Question marks lift intonation; exclamation points alter energy.
- Markup languages. SSML (Speech Synthesis Markup Language) lets prompt-authors mark up emphasis, pause, pitch, and pronunciation. Most production engines accept SSML; many accept their own prompt-style style tags (“[whispering]”, “[excited]”).
Acoustic model architectures#
Three families dominate, each with trade-offs:
- Autoregressive (AR) token models. Generate audio one token at a time, conditioned on text. VALL-E, Tortoise, modern OpenAI / ElevenLabs lines. Highest quality, especially on expressiveness; slowest because each token waits for the previous one.
- Non-autoregressive (NAR) models. Predict a full mel-spectrogram in parallel (FastSpeech, FastSpeech 2). Fast; less expressive variation; needs explicit duration prediction.
- Diffusion-based. Iterative refinement in latent or mel space (NaturalSpeech 2/3, StyleTTS 2). Strong quality; latency depends on step count.
The 2024+ frontier is mostly AR token models on top of discrete audio codecs. They generate text-like sequences of audio tokens; a separate vocoder converts those tokens back to waveform.
Discrete audio codecs#
A neural codec compresses raw audio into a small sequence of discrete tokens, often with multiple codebooks per timestep:
- EnCodec, SoundStream, DAC. Compress 24 kHz audio to 50 to 75 tokens per second across 8 codebooks. The acoustic model predicts these tokens; the codec decoder produces the waveform.
- Why this matters: predicting a few tokens per second is much cheaper than predicting hundreds of samples per second. It also lets language-model architectures handle audio as if it were text.
This is the architectural shift that makes modern voice cloning practical — the audio model is essentially a small LLM operating on audio tokens.
Vocoder#
The vocoder maps the acoustic representation to a final waveform. Even with token-based models, a vocoder (or codec decoder) is the last stage:
- HiFi-GAN, BigVGAN. GAN-based; very fast; the standard for mel-to-wave conversion since 2021.
- Neural codec decoders. When the acoustic model emits codec tokens, the codec’s own decoder is the vocoder. No separate stage needed.
Vocoder latency on a modern GPU is under 50 ms for a few seconds of audio.
Voice conditioning#
How the model knows which voice to use:
- Speaker ID embedding. A learned vector per known voice. Used for product-shipped voice catalogs.
- Reference-based cloning. Pass a 3 to 30 second clip; the model encodes it and conditions generation. Works zero-shot from a single short sample on modern systems.
- Fine-tuned voice. Train an adapter on 5 to 30 minutes of a speaker’s recordings. Higher fidelity than zero-shot cloning; needs the speaker’s audio.
- Multi-speaker latent space. Some models let you interpolate between voices or specify them with text descriptions (“a calm female voice in her thirties”).
Streaming and chunking#
For real-time use, the model can’t wait for the full text. Two patterns:
- Token-level streaming. The acoustic model emits audio tokens as text tokens arrive. The vocoder runs in chunks. End-to-end latency is dominated by chunk size and overlap.
- Sentence chunking. Split text on sentence boundaries; generate each chunk to completion; stream chunks back-to-back. Simpler; introduces audible boundaries on pause/no-pause transitions if not crossfaded.
Sub-300 ms first-byte audio latency is the target for natural voice agents. Modern engines (ElevenLabs Flash, OpenAI Realtime) hit that.
Implementation patterns#
SSML for production control#
Production TTS pipelines wrap user text in SSML for fine control:
<break time="500ms"/>for explicit pauses.<prosody rate="slow" pitch="-2st">for emphasis or comedic timing.<phoneme alphabet="ipa" ph="...">for proper-noun pronunciation that the model won’t get right alone.<say-as interpret-as="characters">for spelled-out IDs.
A custom-pronunciation dictionary (lexicon) sits in front of the model and patches in <phoneme> tags for product names, brand terms, and known mispronunciations.
Voice cloning consent and provenance#
Voice cloning has direct misuse risk. Production systems gate cloning behind:
- Consent capture. A spoken consent phrase (“I authorize cloning my voice for…”) recorded in the same session as the reference clip.
- Voice fingerprinting. Stored hash of the reference clip; alerts if someone tries to clone a flagged or public-figure voice.
- Output watermarking. Inaudible watermarks (Audio Watermarking, SynthID Audio) embedded in generated waveform; detectable by paired detectors.
The pattern is mature enough that responsible API providers refuse cloning without consent capture, and regulators are starting to require it.
Caching and pre-rendering#
For deterministic prompts (system messages, branded greetings, error responses), generate once and cache the waveform. A voice agent that says “I didn’t catch that, could you repeat?” thousands of times per day should never re-synthesize it.
Dynamic style transfer#
Modern engines accept style hints (“read this excitedly”, “calm tone”, “whispering”) that condition expression without changing the voice. Useful for narration where energy shifts mid-paragraph; risky if overdone because style transitions on sentence boundaries can sound jarring.
Cross-lingual synthesis with preserved voice#
A speaker’s voice is encoded once; the model generates in another language with the same timbre. Used heavily in dubbing pipelines and accessibility tools. Quality depends on training data — voices stay convincing across major Indo-European languages; tonal languages (Mandarin, Vietnamese) are harder to clone cleanly into.
Trade-offs#
Other axes:
- Latency vs. quality. Fast models (sub-200 ms first byte) trade some prosody quality for speed. Slow models (1 second+) produce more expressive output. Use fast for dialogue, slow for narration.
- Voice catalog vs. cloning. A fixed catalog is operationally simple — quality is auditioned in advance, voices stay consistent. Cloning is flexible but has consent / abuse surface and quality variance per reference clip.
- Streaming vs. file output. Streaming is necessary for dialogue, complicates caching and post-processing. File output is simpler, can be normalized and watermarked in batch.
- Phonemized vs. raw-text input. Phonemized inputs (g2p preprocessing) give better pronunciation control but require maintained lexicons per language. Raw-text models pull lexicon work into the acoustic model — less control, more general.
Quality and evaluation#
Speech quality has been studied for decades; the metric vocabulary is well-developed:
- Mean Opinion Score (MOS). Human listeners rate samples 1 to 5. The gold standard; expensive; the only metric that captures “does this sound real?”.
- Comparative MOS (CMOS). Two samples (A vs. B), pick the preferred one. More sensitive than absolute MOS for ranking close-quality models.
- Automated proxies.
- UTMOS, MOSNet. Models trained on human MOS ratings; useful for regression detection but unreliable for absolute quality.
- Word Error Rate (WER) of an ASR transcription of the synthesized audio. Catches intelligibility regressions cheaply.
- Speaker similarity (SECS). For cloning evaluation — cosine similarity between speaker embeddings of reference and synthesized audio.
- Production telemetry. Re-listen rate, skip rate, complaint rate, latency distribution. For voice agents: ASR re-prompt rate (the bot asked the user to repeat) often correlates with synthesis quality.
- Pronunciation accuracy. A frozen list of brand terms, proper nouns, and product names with reference pronunciations; ASR-based check on synthesized output.
For regulated deployments (accessibility, automotive, healthcare), intelligibility in noise is also measured — synthesize audio, mix with background noise, run ASR, measure WER. Real-world intelligibility differs significantly from quiet-room MOS.
Common pitfalls#
- Skipping text normalization. Numbers, dates, and abbreviations need explicit rules. Without them, “$3.50” becomes “dollar three dot five zero” or worse.
- No custom pronunciation dictionary. Brand and product names will be mispronounced. A 50-entry lexicon per language is high-leverage.
- First-chunk latency ignored. Streaming masks decode latency but not prefill. If first byte takes 800 ms, the conversation feels broken regardless of how fast the rest streams.
- Voice catalog drift. Engines update voices silently; a voice that sounded one way in February sounds different in May. Pin versions if you ship branded experiences.
- Cloning without consent. Operational and legal risk. Build the consent flow before the feature.
- Missing watermark on generated output. Audio provenance regulation is tightening; absent watermarks force costly retrofits.
- Treating all languages as English. Non-English performance varies wildly per engine and per voice. Audit the languages you actually serve.
- One model for narration and dialogue. Long-form narration and short-form dialogue have different prosody needs. Many production stacks use different models per use case.
- No silence trimming. Models often emit a small silence at start and end of each chunk. Without trimming, concatenated chunks produce audible gaps.
Related applications#
- Audio and Music Generation
- Multimodal Models
- Prompt Engineering
- Text-to-Text Systems
- Autonomous AI Agents
Why discrete audio codecs changed everything
Pre-2022 TTS was a stack of bespoke neural components — phoneme model, duration model, mel predictor, vocoder — each trained separately with task-specific losses. The 2022-2024 shift to discrete audio codecs (SoundStream, EnCodec, DAC) collapsed most of that into “predict the next audio token, autoregressively, given text”. Now a voice cloning model is structurally a small language model with text in and audio tokens out. This is why a single architectural improvement to LLMs (better attention, longer context, mixture-of-experts) immediately benefits TTS — they’re the same family of models. And it’s why zero-shot voice cloning from 3 seconds of reference audio works at all: the model treats voice as a conditioning prefix in the same way an LLM treats a system prompt.