Text-to-Video Generation Systems — Gen AI

Use cases#

Text-to-video systems turn a prompt — and increasingly an input image, an audio track, or a control signal — into a short video clip. The 2024-2026 generation (OpenAI’s Sora, Google’s Veo 2/3, Runway’s Gen-3/4, Kling, Hailuo, Pika, Luma’s Dream Machine) produces 5 to 60 second clips at 720p to 4K with motion coherent enough for short marketing and entertainment use.

The shapes that recur:

Short-form social and ad content. 5 to 15 second clips for TikTok, Reels, YouTube Shorts. Volume is high; quality bar is “feels like a real shot”; cost is the constraint.
Storyboarding and animatics. Pre-production for film, animation, advertising. The model’s output is a working draft for human review.
Product mockups and demos. Short video demonstrating a product in use, generated from a stills + description. Lower fidelity than live-action.
Music videos and lyric videos. Generated visuals matched to existing audio. Volume varies; consistency across cuts is the engineering challenge.
Game cinematics and procedural content. Cutscenes generated on-demand for player-specific context. Latency budget is offline (minutes), not real-time.
Visual effects extensions. Inpainting and outpainting existing footage, generating B-roll, retiming, deaging. Adjacent to traditional VFX pipelines.

Text-to-video is a poor fit when frame-accurate physics is required, when characters must remain identical across many shots without explicit conditioning, when in-clip text must be legible (still a hard problem), or when broadcast-grade compliance matters (provenance metadata is helping but not solving the audit problem).

System overview#

A modern text-to-video pipeline shares much of its architecture with text-to-image but adds a time dimension that dominates the compute cost:

[User prompt + optional conditioning]
    │  - reference image (image-to-video)
    │  - control signals (depth video, pose video, motion brushes)
    │  - audio (for music videos / lipsync)
    ▼
[Prompt preprocessing]
    - safety filter
    - prompt expansion / negative prompts
    - scene-level prompt vs. shot-level prompts
    │
    ▼
[Text and conditioning encoders]
    - T5 / CLIP for prompt
    - image encoder for keyframe (img2vid)
    - audio encoder for sound-conditioned video
    │
    ▼
[Spatiotemporal latent sampler]
    - operates on a tensor: (frames x latent_h x latent_w x channels)
    - 20-50 denoising steps
    - each step: DiT or U-Net with 3D / temporal attention
    - classifier-free guidance
    │
    ▼
[Latent video decoder]
    latent video -> pixel frames (often a 3D VAE)
    │
    ▼
[Temporal upsampling]
    e.g., 12 fps base -> 24 / 30 fps via frame interpolation
    │
    ▼
[Super-resolution]
    720p base -> 1080p / 4K via video SR model
    │
    ▼
[Postprocessing]
    - color grading, stabilization
    - safety classifier on frames
    - audio sync (if applicable)
    - watermark / C2PA provenance
    │
    ▼
[MP4 output]

The latent at each timestep is 3D: width × height × time. A 5-second clip at 24 fps with a typical 8× compression ratio in space and 4× in time is roughly 60 x 64 x 64 x 16 = ~4 million latent scalars — 60 to 100 times more than a single image latent. Each denoising step is a forward pass over the whole spatiotemporal tensor.

Key components#

Spatiotemporal latents#

Modern video models compress along three axes:

Spatial. 8× downsampling per side, same as image latents.
Temporal. 4× to 8× downsampling — a 4-second clip at 24 fps (96 frames) becomes 12 to 24 latent timesteps.
Channel. 8 to 32 latent channels.

The 3D VAE that does this compression is the unsung hero of modern video. It’s trained on huge amounts of video and learned to compress motion as well as appearance. A poor 3D VAE makes everything downstream wobble.

Diffusion backbone: 3D U-Net vs. DiT#

Two architectural families:

3D U-Net with separable spatial and temporal attention. Earlier video models (AnimateDiff, ModelScope, early Stable Video Diffusion). Compute-cheaper per parameter; quality plateaued.
Diffusion Transformer (DiT) on flattened spatiotemporal patches. Sora popularized this; Veo, Kling, and most 2024+ frontier models use it. Scales much better with parameters; needs more memory and compute per step.

DiT models treat video as a long sequence of spatiotemporal patches and apply full attention across them. This is why context length (sequence of patches) is the binding compute constraint — quadratic attention over 50,000+ tokens per frame is brutal.

Classifier-free guidance and motion guidance#

CFG works the same as in image diffusion. Some models add a separate “motion guidance” scale that controls how energetically things move. Low motion guidance gives static, dreamy clips; high motion guidance gives aggressive camera moves and busy scenes.

Conditioning signals#

Beyond text, modern systems accept:

Image-to-video. A start frame, optionally an end frame. Cheap to author; gives strong control over scene composition; constrains the model’s creativity.
Reference video (vid2vid). A driving video that supplies motion or style. Used heavily for stylization and dance generation.
Pose video. Per-frame skeleton control. The output respects the pose sequence — used for character animation.
Camera trajectory. A specified 3D camera path. Some models accept this natively; others approximate it via prompt.
Audio conditioning. Audio-driven generation produces visuals that beat-match or lip-sync to an input track.
Mask / inpainting. Edit a region across all frames consistently. Hard problem because temporal consistency at mask boundaries breaks if not modeled directly.

Long-clip strategies#

Generating long video directly is expensive — quadratic attention in time. Strategies:

Autoregressive chunking. Generate the first chunk; condition the next chunk on the tail of the previous. Style and identity drift over many chunks; modern systems mitigate with explicit context tokens.
Hierarchical generation. A low-frame-rate “keyframe” pass establishes long-range structure; an interpolation pass fills in intermediate frames. Used by some 2024-2026 systems for clips longer than 30 seconds.
Plan-then-generate. An LLM first writes shot descriptions for a long sequence; each shot is generated independently with shared character / scene tokens for consistency. The current approach for “minute-long with multiple scenes” workflows.

Frame coherence#

The hardest sub-problem. Without explicit constraints, video models drift: a character’s shirt color changes, an object teleports, water flows uphill. Modern systems address coherence through:

3D / spatiotemporal attention that lets every patch attend to every other patch across space and time.
Strong motion priors learned from huge video corpora.
Reference frame injection at sample time, anchoring identity.
Optical-flow-aware losses during training that penalize implausible motion.

Implementation patterns#

Image-to-video as the default workflow#

In production, “text-to-video” is rarely pure text. Most pipelines start with a text-to-image generation for the keyframe (or a user-supplied image), then image-to-video for motion. This gives users much more control — they can iterate on composition cheaply with images, then commit to a video generation only when the frame is right.

Prompt expansion with motion specifics#

The user’s prompt is often static (“a forest at dawn”). A small LLM expands it with motion language (“a forest at dawn, mist drifting through the trees, light slowly brightening from the east, leaves swaying gently in a breeze”). Quality improves dramatically; consistency depends on the expansion model.

Temporal upsampling instead of high-fps generation#

Generate at 12 fps; interpolate to 24 fps or 30 fps with a dedicated video frame interpolation model (FILM, RIFE, GIMM-VFI, AMT). Halves the model’s compute budget without halving perceived quality.

Cascade: low-res, then high-res#

Generate at 480p or 720p; upscale to 1080p or 4K with a video SR model. Strictly faster than generating at high resolution directly, and the SR model can specialize in texture detail.

Cut-aware planning for multi-shot sequences#

For sequences with multiple shots, generate each shot independently but pass a “scene bible” (character descriptions, environment, time of day) through all of them. Cuts are added in post. Modern long-form pipelines look more like film production than like image generation — director’s pass, shot-by-shot generation, edit.

Lip-sync conditioning#

For dialogue scenes, audio is fed into the model alongside text and an image. The model generates frames whose mouth movement matches the audio’s phoneme structure. Standalone lip-sync (Wav2Lip, MuseTalk) was state of the art until 2024; integrated audio-conditioned video models are catching up.

Trade-offs#

Hosted API (Sora, Veo, Runway, Kling, Hailuo, Pika, Luma) — top quality, ready-to-ship, opinionated safety. Per-second pricing in the dollars-per-minute range. Latency from queue + generation is often 30 seconds to several minutes per clip.

Self-hosted open-weights (Stable Video Diffusion, CogVideoX, HunyuanVideo, Mochi, LTX-Video, Wan) — full control, custom adapters, on-prem option. Heavy capex on GPUs; quality gap to the frontier APIs is real but closing each quarter.

Other axes:

Length vs. quality. Most models trade off as length grows — a 5-second clip looks great, a 60-second clip drifts. Length budget is per-feature.
Camera control vs. creativity. Tighter camera-trajectory control narrows the model’s freedom and reduces “surprise” quality. Loose camera lets the model find a good shot but may misframe the subject.
Frame rate vs. perceived smoothness. 12 fps + interpolation looks almost as good as 24 fps native and costs half. 8 fps native + interpolation is the sweet spot for some current models.
Diffusion vs. autoregressive. Diffusion dominates current video (DiT-based). Autoregressive video models on discrete video tokens (similar to audio codecs but for video) are an active research line — VideoPoet, Loong, OpenSora hybrids. Compute trade-offs differ; production not yet there.

Quality and evaluation#

Video evaluation is video-game-hard. Established practices:

Automated metrics with caveats.
- FVD (Fréchet Video Distance). Distributional similarity to reference videos. Population-level; insensitive to single-clip artifacts.
- CLIP-Score per frame. Cheap; doesn’t capture motion.
- VBench, EvalCrafter, T2V-CompBench. Multi-axis benchmarks scoring object class, motion coherence, scene composition, etc. Closer to useful than single metrics.
- Optical flow consistency. Per-pixel motion smoothness; catches jitter and teleport artifacts.
Human evaluation. Side-by-side preference is the gold standard. Expensive at scale; necessary for major decisions.
Specialized quality dimensions.
- Temporal consistency (does the dog stay the same dog?).
- Physical plausibility (does water flow downhill, do feet contact ground, do objects fall correctly?).
- Aesthetic appeal (camera framing, lighting, composition).
- Prompt adherence (did the model generate what was asked?).
Safety classifiers across frames. Per-frame and per-clip classifiers for NSFW, violence, public-figure, brand-impersonation, and CSAM. Per-frame catches localized issues; per-clip catches scene-level context (e.g., violence implied by editing).
Production telemetry. Regeneration rate, save rate, share rate, time-to-acceptable-clip, cost per saved output.

Public benchmarks (VBench in particular) are useful for tracking the frontier but heavily gamed; trust your own held-out evaluation more.

Common pitfalls#

Generating long clips when short clips with cuts would work. A series of 5-second shots edited together beats a 30-second single shot for almost every use case.
Skipping image-to-video and going text-to-video. Users can’t iterate cheaply on composition if every iteration is a video generation. Get the frame right first.
Ignoring frame-rate downstream. Many models output 24 fps but deliver 12 fps before interpolation. If the export pipeline doesn’t interpolate, video plays choppy in production.
No prompt expansion. Short prompts produce static, dreamlike clips with no clear motion. The expansion step is almost free and changes results meaningfully.
Treating all motion as good motion. High motion-guidance settings produce aggressive camera moves and busy scenes that look impressive in demos and exhaust viewers in actual ads.
No safety on intermediate frames. A clip can be safe at start and end frames and unsafe in the middle. Per-frame safety, not just first/last.
Watermarking forgotten on export. C2PA / SynthID / visible watermark policies vary by jurisdiction; bake them into the pipeline at generation, not after.
Single-shot generation when multi-shot is needed. Asking the model to generate “a chase scene with three cuts” gives one continuous shot that doesn’t cut. Generate shots separately and edit; this matches how film is actually made.
Lip-sync as an afterthought. Audio-conditioned video models still produce uncanny mouth movement on hard inputs. Plan for a lip-sync postprocess pass for dialogue.

Why Sora's spatiotemporal patches mattered architecturally

Pre-Sora video models mostly bolted temporal layers onto image U-Nets — an image model with extra attention heads that look across frames. The result was video that looked like a sequence of similar images with motion sprinkled in: each frame independently plausible, the sequence implausible. Sora’s contribution was to treat video as a single 3D object and apply full attention to spatiotemporal patches across the entire clip. The model has no architectural distinction between “this is a different position in the frame” and “this is a different frame at the same position” — they’re all patches in the same sequence. The cost is enormous: attention is quadratic in the patch count, and a 5-second clip has tens of thousands of patches. The benefit is that physics, motion, and identity become learnable properties rather than properties enforced by per-frame post-processing. Every frontier video model since has adopted the same pattern in some form.