Audio and Music Generation — Gen AI

Use cases#

Audio and music generation systems produce non-speech audio — full songs, sound effects, ambient textures, instrumental beds, foley. The 2023-2026 generation (MusicLM, MusicGen, AudioLDM, Stable Audio, Suno, Udio, Riffusion, ElevenLabs Audio, Meta’s AudioBox) generates minutes of coherent music from a text prompt at quality good enough for production use in ads, games, and content creation.

The shapes that recur:

Background music for video. YouTubers, podcasters, indie filmmakers generating beds that don’t need a sync license. Volume is the value; per-track quality is moderate.
Game audio and procedural soundscapes. Music and ambient that adapts to game state. Some pipelines generate offline; others stream short loops at runtime.
Advertising and short-form content. 15 to 60 second tracks for ads, trailers, social. Brand-matched style; royalty-free by construction.
Sound effects libraries. “Glass breaking”, “footsteps on gravel”, “spaceship hum”. Replaces or augments stock-library searches.
Songwriting assistants and demos. Lyric + melody + arrangement generation as a starting point for human producers. The output is rarely the final product; it’s a draft.
Audio for accessibility. Generated music with explicit attributes (calm, energetic, focus) for therapy and well-being apps.
Music-conditioned content. Audio passed into a video model for lyric video / music-video generation. See Text-to-Video Systems.

Audio generation is a poor fit when the output must be a specific copyrighted style without licensing (the legal landscape is unresolved), when human-grade musical performance is the bar (modern systems still produce identifiable “AI music” artifacts at close listening), or when long-form structural coherence beyond a few minutes is required.

System overview#

A modern audio generation system shares architectural DNA with TTS and image generation. The dominant pipeline is token-based, operating on discrete audio tokens produced by a neural codec:

[Prompt: text + optional conditioning]
    │  - lyrics
    │  - melody / chord reference audio
    │  - genre / tempo / key tags
    │  - duration target
    ▼
[Prompt preprocessing]
    - parse style and structure tags
    - lyric phonemization (for sung vocals)
    - safety filter (artist style, copyright matches)
    │
    ▼
[Text encoder]
    text -> conditioning vector (T5, MusicLM-style joint embeddings, or CLAP)
    │
    ▼
[Acoustic token model]
    autoregressive transformer OR diffusion
    output: codec tokens at 50-75 Hz across multiple codebooks
    operating in stages (coarse-to-fine) or in parallel
    │
    ▼
[Neural codec decoder]
    tokens -> waveform (SoundStream / EnCodec / DAC / Mel-VAE)
    typically 24 kHz or 44.1 kHz output
    │
    ▼
[Postprocessing]
    - loudness normalization (LUFS target)
    - stereo widening, mastering chain
    - fade in / out
    - watermark (SynthID Audio or similar)
    │
    ▼
[Audio out: WAV / FLAC / MP3]

Compared to TTS, music has longer time horizons (minutes, not seconds), richer harmonic content, and an extra layer of structural complexity (sections, phrases, motifs). That extra structure pushes the model toward hierarchical generation strategies that TTS doesn’t need.

Key components#

Representation: waveform vs. spectrogram vs. tokens#

Three families of audio representation, each with a corresponding generative approach:

Raw waveform. Operate directly on samples at 16/24/44.1/48 kHz. WaveNet pioneered this in 2016. Quality is high; compute cost is brutal — autoregressive at 44.1 kHz is 44,100 samples per second of audio.
Spectrogram / mel. Operate on time-frequency representations (typically mel-spectrograms). Riffusion famously turned this into “treat music as an image and run image diffusion”. Quality is good for short clips; long-range structure is harder.
Discrete audio codec tokens. Compress audio with a neural codec (SoundStream, EnCodec, DAC) into a sequence of discrete tokens at 50 to 75 Hz across 4 to 9 codebooks. Generate the tokens with a language model. This is the dominant approach for production music systems since 2023.

Token-based generation is the consequential idea — it turns “predict the next millisecond of audio” into “predict the next discrete token”, which language-model architectures handle natively.

Coarse-to-fine token generation#

Codec tokens come in stages — the first codebook captures coarse content (melody, rhythm, dominant timbre); later codebooks add fine detail (texture, ambience, exact spectral content). Production models exploit this:

Coarse stage. An autoregressive model generates the first 1 or 2 codebook streams. This determines the music’s structure and main content.
Fine stage. A separate (often non-autoregressive) model fills in the remaining codebooks conditioned on the coarse tokens. This adds detail without lengthening the autoregressive chain.

MusicLM, MusicGen, and most descendants follow this two-stage pattern. The win is that the slow autoregressive part operates on a much shorter sequence than full-quality token rates.

Conditioning encoders#

How the model knows what to generate:

Text via joint embeddings (CLAP-style). Train a model that places text descriptions and audio clips in a shared embedding space. Text conditioning is then a vector in that space. Used by MusicLM and similar.
Text via T5 or other large text encoder. Standard text encoding, no audio-text alignment needed at conditioning time. Used by MusicGen and many open systems.
Melody conditioning. Pass a reference clip; encode its melody (chromagram or chroma tokens); the model generates new audio with the same melodic line in a different style. “Hum it; restyle it” workflows.
Lyrics conditioning. For models that generate sung vocals (Suno, Udio), lyrics are passed alongside the prompt and align with generated vocal tokens.
Genre / tag conditioning. Some models accept explicit tag inputs (“rock, 120 bpm, distorted guitar”). Combines with text prompt.

Structural conditioning and long-form generation#

Generating coherent multi-minute music requires either very long context or hierarchical generation:

Long-context autoregression. The model generates seconds-to-minutes of audio in a single autoregressive pass. Limited by the sequence length the model can handle.
Hierarchical or hybrid generation. Generate a structural plan first (intro / verse / chorus / bridge), then realize each section, then refine. Modern long-form music models (Suno, Udio for 2-3 minute tracks) use variants of this.
Chunk-and-extend. Generate one section, condition the next section on the tail of the previous one. Drift across many chunks is the failure mode.

Vocals: speech-like generation inside music#

Generating sung vocals is harder than instrumental music: lyrics must phonetically align with the melody, vocal timbre must stay consistent, and pitch / vibrato must be musical. The 2024 wave of music models with vocals (Suno, Udio) combines a music-conditioned acoustic model with TTS-like lyric phonemization, producing tokens that include both vocal and instrumental content jointly.

Audio codec choice#

The codec is the unsung hero. A good codec compresses 44.1 kHz stereo into ~75 tokens per second per codebook with minimal perceptual loss. The codec’s decoder is essentially the vocoder for the system.

SoundStream / Lyra. Google’s line; trained on diverse audio.
EnCodec. Meta’s line; widely used in open-source music generation.
DAC (Descript Audio Codec). High-quality general audio; the de facto choice for many 2024-2026 systems.
Specialized music codecs. Tuned on music corpora; better fidelity on harmonic content; less general.

Codec quality is the ceiling. A perfect generative model with a mediocre codec produces mediocre audio.

Implementation patterns#

Lyric-aligned vocal generation#

For songs with vocals: phonemize the lyrics, generate a melody track via instrumental tokens, then generate vocal tokens conditioned on both the lyrics and the melody. Alignment is enforced via attention masks or duration models inherited from TTS. The result is sung audio whose syllables track the underlying melody.

Stem separation as a postprocess#

Some workflows generate a full mix, then separate it into stems (vocals, drums, bass, other) using a source-separation model (Demucs, Open-Unmix, Spleeter). Producers can then remix the stems. The reverse direction — generating stems separately and mixing them — is harder because the model must keep them harmonically consistent.

Mastering chain#

Generated music often comes out perceptually quiet or unmastered. A postprocess applies dynamic range compression, EQ, stereo widening, and loudness normalization (typically -14 LUFS for streaming, -23 LUFS for broadcast). Off-the-shelf mastering chains (LANDR, BandLab, eMastered) or hand-built DSP graphs do this.

Style prompts and reference clips#

The most reliable way to get a specific style is a reference clip rather than a text description. “Sounds like this 5-second sample” beats “lo-fi hip-hop with jazz piano and rain ambience” for most production goals. Reference-based generation also clarifies legal posture — the reference is the user’s input, not a request to imitate a protected work.

Genre and tempo constraints#

Hard constraints on tempo (BPM), key, and meter help downstream syncing to picture or to existing tracks. Some models accept these natively; others require multiple generations and a tempo-quantization postprocess.

Trade-offs#

Hosted API (Suno, Udio, Stable Audio, ElevenLabs Music, Meta AudioBox) — best vocal music quality, integrated mastering, opinionated copyright policies, fast iteration. Per-track or per-minute pricing. Cannot inspect or fine-tune weights.

Self-hosted open-weights (MusicGen, AudioLDM, Stable Audio Open, MAGNeT, JukeBox-descendants) — full control, custom fine-tuning, on-prem option. Quality is competitive for instrumental music; vocals lag the hosted leaders meaningfully.

Other axes:

Token autoregressive vs. diffusion. Token AR (MusicGen, MusicLM, Suno) dominates for music with vocals. Diffusion on mels or latents (AudioLDM, Stable Audio) is strong on instrumental music and sound effects, faster per-clip, less natural for long-form vocals.
Latency vs. quality. Real-time generation for interactive apps (game audio, live performance) needs sub-second first-byte; high-quality offline generation tolerates tens of seconds.
Stereo vs. mono. Higher-quality models produce stereo natively. Older or cheaper models produce mono; stereo widening in post is a workaround with audible artifacts on critical listening.
Sample rate. 16 kHz models exist but sound muffled. 24 kHz is the practical floor; 44.1 kHz is the production target. 48 kHz is overkill for most music but standard for video deliverables.

Quality and evaluation#

Music evaluation is qualitative-by-nature, but the toolkit has matured:

Human listening tests. Mean Opinion Score (MOS) and comparative MOS (CMOS), as in TTS. The gold standard; expensive; necessary for major decisions.
Automated metrics with limits.
- FAD (Fréchet Audio Distance). Population-level distributional similarity to reference audio. Standard for music model benchmarks; insensitive to per-clip quality.
- CLAP-Score. Cosine similarity between prompt and generated clip’s joint embedding. Cheap; useful for ranking; weak on structural quality.
- CLAP-MAP / VGGish-based metrics. Variants on the same idea.
Structural and musical metrics.
- Tempo accuracy (does the output hit the requested BPM?).
- Key conformance (does it stay in the requested key?).
- Section structure plausibility (verse-chorus-verse vs. random noise).
Copyright and similarity checks.
- Audio fingerprinting against catalogs (similar to Shazam-style identifications) on output.
- Stem-level similarity for “is this drum loop too close to an existing recording?”.
Production telemetry. Save rate, share rate, regenerate rate, in-app rating, attribution complaint rate. Skip rate at specific timestamps often correlates with quality drops at song sections.

For voice-bearing models, additional dimensions: vocal naturalness (separate MOS), lyric intelligibility (ASR-based WER on the vocal stem), pitch accuracy (does it stay on key?).

Common pitfalls#

No loudness normalization on output. Generated tracks are often perceptually quiet or wildly inconsistent in level. A LUFS normalization pass is essentially free and dramatically improves user experience.
Treating music as TTS. Music has multi-track structure, harmonic content, and long-form sections that TTS models don’t need. Architectures and conditioning that work for speech often fall over on music.
Skipping the codec quality audit. The ceiling on audio quality is the codec’s reconstruction fidelity. If the codec sounds bad on real audio, the model can’t fix it.
Prompt-style artist names. “In the style of [famous artist]” is a copyright minefield. Production policies typically forbid living-artist names in prompts.
No watermark. Audio provenance regulation is rising; absent watermarks cause expensive retrofits and complicate compliance.
Chunk-and-extend without smoothing. Section boundaries audible as clicks or pitch jumps. Crossfade and key-match across chunk boundaries.
Lyric phonemization gaps in non-English. Models trained primarily on English vocals produce mis-articulated vocals in other languages. Audit per language.
Ignoring tempo and meter constraints downstream. Music delivered to video pipelines must conform to picture timing. Round trips on this waste days.
Single-take generation as the only mode. Producers expect to regenerate sections, swap stems, and stitch takes. Build the editing surface, not just one-shot generation.

Why Riffusion was a useful detour even though it didn't win

Riffusion in 2022 trained Stable Diffusion on mel-spectrograms — basically, treat music as a 2D image of frequency over time and generate it with an image diffusion model. The output was a spectrogram you converted back to audio with a vocoder. It worked, partially. The quality was charming-but-rough; long-range structure was poor; vocals were essentially impossible. But it taught the field a useful lesson: audio representations matter more than model architecture for music quality. The follow-on systems (MusicLM, MusicGen, Suno) abandoned spectrograms for token-based representations because tokens preserve fine-grained content better, but the conceptual move — “find a representation in which generation is tractable, then generate” — was Riffusion’s contribution. Modern music models stand on token-based codecs that learned the same lesson and went further: compress audio aggressively while preserving perceptual quality, then generate the compressed form with the architecture you trust.