Text-to-Speech Generation Systems

Neural TTS, voice cloning, prosody, the streaming-audio pipeline. What real-time voice products are actually doing.

Application Intermediate
11 min read
applications tts speech voice-cloning

Use cases#

A text-to-speech (TTS) system turns text into audio that sounds like a person speaking. The 2010s versions (concatenative, then neural vocoders with predefined voices) were good enough for navigation and screen readers. The 2020s versions — ElevenLabs, OpenAI’s voice models, Google’s Chirp 3, Microsoft’s VALL-E line — produce voices that are hard to distinguish from human recordings, can be cloned from seconds of reference audio, and stream with latency low enough for real-time dialogue.

The shapes that recur:

  • Voice agents. Phone systems, drive-throughs, customer support bots, voice assistants. Real-time bidirectional audio; latency budget is harsh (sub-500 ms end-to-end).
  • Content narration. Audiobooks, podcasts, video voiceover, e-learning. Offline; quality and consistency matter more than latency.
  • Accessibility. Screen readers, reading aids, communication tools for non-speaking users. Trades cutting-edge quality for stability, language coverage, and offline support.
  • Localization. Dubbing existing video into other languages while preserving the speaker’s voice. Combines voice cloning with cross-lingual synthesis.
  • Game characters and dynamic dialogue. Runtime synthesis for player-specific content. Mid-latency budget; voice consistency across thousands of lines is the hard problem.
  • Custom branded voices. A product or company “voice” used across all generated audio. Cloned once; reused everywhere.

TTS is a poor fit when the audio must contain musical content (different stack — see Audio and Music Generation), when the production needs frame-accurate timing to existing video, or when the deployment context cannot tolerate any synthesized-voice artifacts.

System overview#

A modern neural TTS system has three stages — text frontend, acoustic model, and vocoder — though increasingly the boundaries blur into end-to-end systems:

[Input text + voice ID + style hints]
[Text frontend]
- normalization (numbers, dates, abbreviations)
- tokenization
- phonemization (text -> phonemes, optional)
- language / script detection
[Conditioning]
- voice / speaker embedding (from a reference clip or a stored profile)
- style / emotion vector
- prosody hints (SSML, punctuation cues)
[Acoustic model]
- autoregressive token model OR
- non-autoregressive (FastSpeech-style) OR
- diffusion (NaturalSpeech, Tortoise variants)
output: mel-spectrogram or discrete audio tokens
[Vocoder / decoder]
- mel -> waveform (HiFi-GAN, BigVGAN) OR
- tokens -> waveform (SoundStream, EnCodec)
output: 24 kHz or 48 kHz PCM
[Postprocessing]
- level normalization, EQ
- chunk-boundary smoothing for streaming
- watermark
[Audio out — streamed or downloaded]

The total parameter count is much smaller than a frontier LLM (often under a billion parameters), so single-GPU latency is dominated by autoregressive decode and vocoder pass, not by raw model size.

Key components#

Text frontend#

The unglamorous half. Bad normalization is the single most common quality bug:

  • Numbers, dates, currency, time. “$3.50 on 1/2/25” must become spoken text. Locale-aware: 1/2/25 is January 2nd in the US and February 1st in most of the world.
  • Abbreviations. “Dr.”, “St.”, “Inc.”, “etc.” — context-dependent expansion (“Dr. Smith” vs. “the dr. (drive)”).
  • Acronyms vs. initialisms. “NASA” is read as a word, “FBI” as letters. Many models include classifier heads for this.
  • Punctuation as prosody. Periods, commas, ellipses, em-dashes all map to pauses of different lengths. Question marks lift intonation; exclamation points alter energy.
  • Markup languages. SSML (Speech Synthesis Markup Language) lets prompt-authors mark up emphasis, pause, pitch, and pronunciation. Most production engines accept SSML; many accept their own prompt-style style tags (“[whispering]”, “[excited]”).

Acoustic model architectures#

Three families dominate, each with trade-offs:

  • Autoregressive (AR) token models. Generate audio one token at a time, conditioned on text. VALL-E, Tortoise, modern OpenAI / ElevenLabs lines. Highest quality, especially on expressiveness; slowest because each token waits for the previous one.
  • Non-autoregressive (NAR) models. Predict a full mel-spectrogram in parallel (FastSpeech, FastSpeech 2). Fast; less expressive variation; needs explicit duration prediction.
  • Diffusion-based. Iterative refinement in latent or mel space (NaturalSpeech 2/3, StyleTTS 2). Strong quality; latency depends on step count.

The 2024+ frontier is mostly AR token models on top of discrete audio codecs. They generate text-like sequences of audio tokens; a separate vocoder converts those tokens back to waveform.

Discrete audio codecs#

A neural codec compresses raw audio into a small sequence of discrete tokens, often with multiple codebooks per timestep:

  • EnCodec, SoundStream, DAC. Compress 24 kHz audio to 50 to 75 tokens per second across 8 codebooks. The acoustic model predicts these tokens; the codec decoder produces the waveform.
  • Why this matters: predicting a few tokens per second is much cheaper than predicting hundreds of samples per second. It also lets language-model architectures handle audio as if it were text.

This is the architectural shift that makes modern voice cloning practical — the audio model is essentially a small LLM operating on audio tokens.

Vocoder#

The vocoder maps the acoustic representation to a final waveform. Even with token-based models, a vocoder (or codec decoder) is the last stage:

  • HiFi-GAN, BigVGAN. GAN-based; very fast; the standard for mel-to-wave conversion since 2021.
  • Neural codec decoders. When the acoustic model emits codec tokens, the codec’s own decoder is the vocoder. No separate stage needed.

Vocoder latency on a modern GPU is under 50 ms for a few seconds of audio.

Voice conditioning#

How the model knows which voice to use:

  • Speaker ID embedding. A learned vector per known voice. Used for product-shipped voice catalogs.
  • Reference-based cloning. Pass a 3 to 30 second clip; the model encodes it and conditions generation. Works zero-shot from a single short sample on modern systems.
  • Fine-tuned voice. Train an adapter on 5 to 30 minutes of a speaker’s recordings. Higher fidelity than zero-shot cloning; needs the speaker’s audio.
  • Multi-speaker latent space. Some models let you interpolate between voices or specify them with text descriptions (“a calm female voice in her thirties”).

Streaming and chunking#

For real-time use, the model can’t wait for the full text. Two patterns:

  • Token-level streaming. The acoustic model emits audio tokens as text tokens arrive. The vocoder runs in chunks. End-to-end latency is dominated by chunk size and overlap.
  • Sentence chunking. Split text on sentence boundaries; generate each chunk to completion; stream chunks back-to-back. Simpler; introduces audible boundaries on pause/no-pause transitions if not crossfaded.

Sub-300 ms first-byte audio latency is the target for natural voice agents. Modern engines (ElevenLabs Flash, OpenAI Realtime) hit that.

Implementation patterns#

SSML for production control#

Production TTS pipelines wrap user text in SSML for fine control:

  • <break time="500ms"/> for explicit pauses.
  • <prosody rate="slow" pitch="-2st"> for emphasis or comedic timing.
  • <phoneme alphabet="ipa" ph="..."> for proper-noun pronunciation that the model won’t get right alone.
  • <say-as interpret-as="characters"> for spelled-out IDs.

A custom-pronunciation dictionary (lexicon) sits in front of the model and patches in <phoneme> tags for product names, brand terms, and known mispronunciations.

Voice cloning has direct misuse risk. Production systems gate cloning behind:

  • Consent capture. A spoken consent phrase (“I authorize cloning my voice for…”) recorded in the same session as the reference clip.
  • Voice fingerprinting. Stored hash of the reference clip; alerts if someone tries to clone a flagged or public-figure voice.
  • Output watermarking. Inaudible watermarks (Audio Watermarking, SynthID Audio) embedded in generated waveform; detectable by paired detectors.

The pattern is mature enough that responsible API providers refuse cloning without consent capture, and regulators are starting to require it.

Caching and pre-rendering#

For deterministic prompts (system messages, branded greetings, error responses), generate once and cache the waveform. A voice agent that says “I didn’t catch that, could you repeat?” thousands of times per day should never re-synthesize it.

Dynamic style transfer#

Modern engines accept style hints (“read this excitedly”, “calm tone”, “whispering”) that condition expression without changing the voice. Useful for narration where energy shifts mid-paragraph; risky if overdone because style transitions on sentence boundaries can sound jarring.

Cross-lingual synthesis with preserved voice#

A speaker’s voice is encoded once; the model generates in another language with the same timbre. Used heavily in dubbing pipelines and accessibility tools. Quality depends on training data — voices stay convincing across major Indo-European languages; tonal languages (Mandarin, Vietnamese) are harder to clone cleanly into.

Trade-offs#

Hosted API (ElevenLabs, OpenAI, Azure, Google, Amazon Polly) — best quality voices, multilingual catalogs, voice-cloning policies built in. Per-character pricing. Latency depends on the provider’s region.
Self-hosted open-weights (Coqui XTTS, Tortoise, Bark, OpenVoice, F5-TTS, Kokoro) — full control, on-prem, no per-character cost at volume. Quality gap is narrowing but real for top-tier voices and expressive styles.

Other axes:

  • Latency vs. quality. Fast models (sub-200 ms first byte) trade some prosody quality for speed. Slow models (1 second+) produce more expressive output. Use fast for dialogue, slow for narration.
  • Voice catalog vs. cloning. A fixed catalog is operationally simple — quality is auditioned in advance, voices stay consistent. Cloning is flexible but has consent / abuse surface and quality variance per reference clip.
  • Streaming vs. file output. Streaming is necessary for dialogue, complicates caching and post-processing. File output is simpler, can be normalized and watermarked in batch.
  • Phonemized vs. raw-text input. Phonemized inputs (g2p preprocessing) give better pronunciation control but require maintained lexicons per language. Raw-text models pull lexicon work into the acoustic model — less control, more general.

Quality and evaluation#

Speech quality has been studied for decades; the metric vocabulary is well-developed:

  1. Mean Opinion Score (MOS). Human listeners rate samples 1 to 5. The gold standard; expensive; the only metric that captures “does this sound real?”.
  2. Comparative MOS (CMOS). Two samples (A vs. B), pick the preferred one. More sensitive than absolute MOS for ranking close-quality models.
  3. Automated proxies.
    • UTMOS, MOSNet. Models trained on human MOS ratings; useful for regression detection but unreliable for absolute quality.
    • Word Error Rate (WER) of an ASR transcription of the synthesized audio. Catches intelligibility regressions cheaply.
    • Speaker similarity (SECS). For cloning evaluation — cosine similarity between speaker embeddings of reference and synthesized audio.
  4. Production telemetry. Re-listen rate, skip rate, complaint rate, latency distribution. For voice agents: ASR re-prompt rate (the bot asked the user to repeat) often correlates with synthesis quality.
  5. Pronunciation accuracy. A frozen list of brand terms, proper nouns, and product names with reference pronunciations; ASR-based check on synthesized output.

For regulated deployments (accessibility, automotive, healthcare), intelligibility in noise is also measured — synthesize audio, mix with background noise, run ASR, measure WER. Real-world intelligibility differs significantly from quiet-room MOS.

Common pitfalls#

  • Skipping text normalization. Numbers, dates, and abbreviations need explicit rules. Without them, “$3.50” becomes “dollar three dot five zero” or worse.
  • No custom pronunciation dictionary. Brand and product names will be mispronounced. A 50-entry lexicon per language is high-leverage.
  • First-chunk latency ignored. Streaming masks decode latency but not prefill. If first byte takes 800 ms, the conversation feels broken regardless of how fast the rest streams.
  • Voice catalog drift. Engines update voices silently; a voice that sounded one way in February sounds different in May. Pin versions if you ship branded experiences.
  • Cloning without consent. Operational and legal risk. Build the consent flow before the feature.
  • Missing watermark on generated output. Audio provenance regulation is tightening; absent watermarks force costly retrofits.
  • Treating all languages as English. Non-English performance varies wildly per engine and per voice. Audit the languages you actually serve.
  • One model for narration and dialogue. Long-form narration and short-form dialogue have different prosody needs. Many production stacks use different models per use case.
  • No silence trimming. Models often emit a small silence at start and end of each chunk. Without trimming, concatenated chunks produce audible gaps.
Why discrete audio codecs changed everything

Pre-2022 TTS was a stack of bespoke neural components — phoneme model, duration model, mel predictor, vocoder — each trained separately with task-specific losses. The 2022-2024 shift to discrete audio codecs (SoundStream, EnCodec, DAC) collapsed most of that into “predict the next audio token, autoregressively, given text”. Now a voice cloning model is structurally a small language model with text in and audio tokens out. This is why a single architectural improvement to LLMs (better attention, longer context, mixture-of-experts) immediately benefits TTS — they’re the same family of models. And it’s why zero-shot voice cloning from 3 seconds of reference audio works at all: the model treats voice as a conditioning prefix in the same way an LLM treats a system prompt.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.