Text-to-Speech Generation Systems — Gen AI

Use cases#

A text-to-speech (TTS) system turns text into audio that sounds like a person speaking. The 2010s versions (concatenative, then neural vocoders with predefined voices) were good enough for navigation and screen readers. The 2020s versions — ElevenLabs, OpenAI’s voice models, Google’s Chirp 3, Microsoft’s VALL-E line — produce voices that are hard to distinguish from human recordings, can be cloned from seconds of reference audio, and stream with latency low enough for real-time dialogue.

The shapes that recur:

Voice agents. Phone systems, drive-throughs, customer support bots, voice assistants. Real-time bidirectional audio; latency budget is harsh (sub-500 ms end-to-end).
Content narration. Audiobooks, podcasts, video voiceover, e-learning. Offline; quality and consistency matter more than latency.
Accessibility. Screen readers, reading aids, communication tools for non-speaking users. Trades cutting-edge quality for stability, language coverage, and offline support.
Localization. Dubbing existing video into other languages while preserving the speaker’s voice. Combines voice cloning with cross-lingual synthesis.
Game characters and dynamic dialogue. Runtime synthesis for player-specific content. Mid-latency budget; voice consistency across thousands of lines is the hard problem.
Custom branded voices. A product or company “voice” used across all generated audio. Cloned once; reused everywhere.

TTS is a poor fit when the audio must contain musical content (different stack — see Audio and Music Generation), when the production needs frame-accurate timing to existing video, or when the deployment context cannot tolerate any synthesized-voice artifacts.

System overview#

A modern neural TTS system has three stages — text frontend, acoustic model, and vocoder — though increasingly the boundaries blur into end-to-end systems:

[Input text + voice ID + style hints]
    │
    ▼
[Text frontend]
    - normalization (numbers, dates, abbreviations)
    - tokenization
    - phonemization (text -> phonemes, optional)
    - language / script detection
    │
    ▼
[Conditioning]
    - voice / speaker embedding (from a reference clip or a stored profile)
    - style / emotion vector
    - prosody hints (SSML, punctuation cues)
    │
    ▼
[Acoustic model]
    - autoregressive token model OR
    - non-autoregressive (FastSpeech-style) OR
    - diffusion (NaturalSpeech, Tortoise variants)
    output: mel-spectrogram or discrete audio tokens
    │
    ▼
[Vocoder / decoder]
    - mel -> waveform (HiFi-GAN, BigVGAN)  OR
    - tokens -> waveform (SoundStream, EnCodec)
    output: 24 kHz or 48 kHz PCM
    │
    ▼
[Postprocessing]
    - level normalization, EQ
    - chunk-boundary smoothing for streaming
    - watermark
    │
    ▼
[Audio out — streamed or downloaded]

The total parameter count is much smaller than a frontier LLM (often under a billion parameters), so single-GPU latency is dominated by autoregressive decode and vocoder pass, not by raw model size.

Key components#

Text frontend#

The unglamorous half. Bad normalization is the single most common quality bug:

Numbers, dates, currency, time. “$3.50 on 1/2/25” must become spoken text. Locale-aware: 1/2/25 is January 2nd in the US and February 1st in most of the world.
Abbreviations. “Dr.”, “St.”, “Inc.”, “etc.” — context-dependent expansion (“Dr. Smith” vs. “the dr. (drive)”).
Acronyms vs. initialisms. “NASA” is read as a word, “FBI” as letters. Many models include classifier heads for this.
Punctuation as prosody. Periods, commas, ellipses, em-dashes all map to pauses of different lengths. Question marks lift intonation; exclamation points alter energy.
Markup languages. SSML (Speech Synthesis Markup Language) lets prompt-authors mark up emphasis, pause, pitch, and pronunciation. Most production engines accept SSML; many accept their own prompt-style style tags (“[whispering]”, “[excited]”).

Acoustic model architectures#

Three families dominate, each with trade-offs:

Autoregressive (AR) token models. Generate audio one token at a time, conditioned on text. VALL-E, Tortoise, modern OpenAI / ElevenLabs lines. Highest quality, especially on expressiveness; slowest because each token waits for the previous one.
Non-autoregressive (NAR) models. Predict a full mel-spectrogram in parallel (FastSpeech, FastSpeech 2). Fast; less expressive variation; needs explicit duration prediction.
Diffusion-based. Iterative refinement in latent or mel space (NaturalSpeech 2/3, StyleTTS 2). Strong quality; latency depends on step count.

The 2024+ frontier is mostly AR token models on top of discrete audio codecs. They generate text-like sequences of audio tokens; a separate vocoder converts those tokens back to waveform.

Discrete audio codecs#

A neural codec compresses raw audio into a small sequence of discrete tokens, often with multiple codebooks per timestep:

EnCodec, SoundStream, DAC. Compress 24 kHz audio to 50 to 75 tokens per second across 8 codebooks. The acoustic model predicts these tokens; the codec decoder produces the waveform.
Why this matters: predicting a few tokens per second is much cheaper than predicting hundreds of samples per second. It also lets language-model architectures handle audio as if it were text.

This is the architectural shift that makes modern voice cloning practical — the audio model is essentially a small LLM operating on audio tokens.

Vocoder#

The vocoder maps the acoustic representation to a final waveform. Even with token-based models, a vocoder (or codec decoder) is the last stage:

HiFi-GAN, BigVGAN. GAN-based; very fast; the standard for mel-to-wave conversion since 2021.
Neural codec decoders. When the acoustic model emits codec tokens, the codec’s own decoder is the vocoder. No separate stage needed.

Vocoder latency on a modern GPU is under 50 ms for a few seconds of audio.

Voice conditioning#

How the model knows which voice to use:

Speaker ID embedding. A learned vector per known voice. Used for product-shipped voice catalogs.
Reference-based cloning. Pass a 3 to 30 second clip; the model encodes it and conditions generation. Works zero-shot from a single short sample on modern systems.
Fine-tuned voice. Train an adapter on 5 to 30 minutes of a speaker’s recordings. Higher fidelity than zero-shot cloning; needs the speaker’s audio.
Multi-speaker latent space. Some models let you interpolate between voices or specify them with text descriptions (“a calm female voice in her thirties”).

Streaming and chunking#

For real-time use, the model can’t wait for the full text. Two patterns:

Token-level streaming. The acoustic model emits audio tokens as text tokens arrive. The vocoder runs in chunks. End-to-end latency is dominated by chunk size and overlap.
Sentence chunking. Split text on sentence boundaries; generate each chunk to completion; stream chunks back-to-back. Simpler; introduces audible boundaries on pause/no-pause transitions if not crossfaded.

Sub-300 ms first-byte audio latency is the target for natural voice agents. Modern engines (ElevenLabs Flash, OpenAI Realtime) hit that.

Implementation patterns#

SSML for production control#

Production TTS pipelines wrap user text in SSML for fine control:

<break time="500ms"/> for explicit pauses.
<prosody rate="slow" pitch="-2st"> for emphasis or comedic timing.
<phoneme alphabet="ipa" ph="..."> for proper-noun pronunciation that the model won’t get right alone.
<say-as interpret-as="characters"> for spelled-out IDs.

A custom-pronunciation dictionary (lexicon) sits in front of the model and patches in <phoneme> tags for product names, brand terms, and known mispronunciations.

Voice cloning has direct misuse risk. Production systems gate cloning behind:

Consent capture. A spoken consent phrase (“I authorize cloning my voice for…”) recorded in the same session as the reference clip.
Voice fingerprinting. Stored hash of the reference clip; alerts if someone tries to clone a flagged or public-figure voice.
Output watermarking. Inaudible watermarks (Audio Watermarking, SynthID Audio) embedded in generated waveform; detectable by paired detectors.

The pattern is mature enough that responsible API providers refuse cloning without consent capture, and regulators are starting to require it.

Caching and pre-rendering#

For deterministic prompts (system messages, branded greetings, error responses), generate once and cache the waveform. A voice agent that says “I didn’t catch that, could you repeat?” thousands of times per day should never re-synthesize it.

Dynamic style transfer#

Modern engines accept style hints (“read this excitedly”, “calm tone”, “whispering”) that condition expression without changing the voice. Useful for narration where energy shifts mid-paragraph; risky if overdone because style transitions on sentence boundaries can sound jarring.

Cross-lingual synthesis with preserved voice#

A speaker’s voice is encoded once; the model generates in another language with the same timbre. Used heavily in dubbing pipelines and accessibility tools. Quality depends on training data — voices stay convincing across major Indo-European languages; tonal languages (Mandarin, Vietnamese) are harder to clone cleanly into.

Trade-offs#

Hosted API (ElevenLabs, OpenAI, Azure, Google, Amazon Polly) — best quality voices, multilingual catalogs, voice-cloning policies built in. Per-character pricing. Latency depends on the provider’s region.

Self-hosted open-weights (Coqui XTTS, Tortoise, Bark, OpenVoice, F5-TTS, Kokoro) — full control, on-prem, no per-character cost at volume. Quality gap is narrowing but real for top-tier voices and expressive styles.

Other axes:

Latency vs. quality. Fast models (sub-200 ms first byte) trade some prosody quality for speed. Slow models (1 second+) produce more expressive output. Use fast for dialogue, slow for narration.
Voice catalog vs. cloning. A fixed catalog is operationally simple — quality is auditioned in advance, voices stay consistent. Cloning is flexible but has consent / abuse surface and quality variance per reference clip.
Streaming vs. file output. Streaming is necessary for dialogue, complicates caching and post-processing. File output is simpler, can be normalized and watermarked in batch.
Phonemized vs. raw-text input. Phonemized inputs (g2p preprocessing) give better pronunciation control but require maintained lexicons per language. Raw-text models pull lexicon work into the acoustic model — less control, more general.

Quality and evaluation#

Speech quality has been studied for decades; the metric vocabulary is well-developed:

Mean Opinion Score (MOS). Human listeners rate samples 1 to 5. The gold standard; expensive; the only metric that captures “does this sound real?”.
Comparative MOS (CMOS). Two samples (A vs. B), pick the preferred one. More sensitive than absolute MOS for ranking close-quality models.
Automated proxies.
- UTMOS, MOSNet. Models trained on human MOS ratings; useful for regression detection but unreliable for absolute quality.
- Word Error Rate (WER) of an ASR transcription of the synthesized audio. Catches intelligibility regressions cheaply.
- Speaker similarity (SECS). For cloning evaluation — cosine similarity between speaker embeddings of reference and synthesized audio.
Production telemetry. Re-listen rate, skip rate, complaint rate, latency distribution. For voice agents: ASR re-prompt rate (the bot asked the user to repeat) often correlates with synthesis quality.
Pronunciation accuracy. A frozen list of brand terms, proper nouns, and product names with reference pronunciations; ASR-based check on synthesized output.

For regulated deployments (accessibility, automotive, healthcare), intelligibility in noise is also measured — synthesize audio, mix with background noise, run ASR, measure WER. Real-world intelligibility differs significantly from quiet-room MOS.

Common pitfalls#

Skipping text normalization. Numbers, dates, and abbreviations need explicit rules. Without them, “$3.50” becomes “dollar three dot five zero” or worse.
No custom pronunciation dictionary. Brand and product names will be mispronounced. A 50-entry lexicon per language is high-leverage.
First-chunk latency ignored. Streaming masks decode latency but not prefill. If first byte takes 800 ms, the conversation feels broken regardless of how fast the rest streams.
Voice catalog drift. Engines update voices silently; a voice that sounded one way in February sounds different in May. Pin versions if you ship branded experiences.
Cloning without consent. Operational and legal risk. Build the consent flow before the feature.
Missing watermark on generated output. Audio provenance regulation is tightening; absent watermarks force costly retrofits.
Treating all languages as English. Non-English performance varies wildly per engine and per voice. Audit the languages you actually serve.
One model for narration and dialogue. Long-form narration and short-form dialogue have different prosody needs. Many production stacks use different models per use case.
No silence trimming. Models often emit a small silence at start and end of each chunk. Without trimming, concatenated chunks produce audible gaps.

Why discrete audio codecs changed everything

Pre-2022 TTS was a stack of bespoke neural components — phoneme model, duration model, mel predictor, vocoder — each trained separately with task-specific losses. The 2022-2024 shift to discrete audio codecs (SoundStream, EnCodec, DAC) collapsed most of that into “predict the next audio token, autoregressively, given text”. Now a voice cloning model is structurally a small language model with text in and audio tokens out. This is why a single architectural improvement to LLMs (better attention, longer context, mixture-of-experts) immediately benefits TTS — they’re the same family of models. And it’s why zero-shot voice cloning from 3 seconds of reference audio works at all: the model treats voice as a conditioning prefix in the same way an LLM treats a system prompt.