Text-to-Image Generation Systems — Gen AI

Use cases#

A text-to-image system takes a natural-language prompt and emits a pixel image. Modern systems also accept conditioning signals — a reference image, a pose skeleton, a depth map, an existing image to edit — and produce something that respects both the text and the structural input.

The shapes that recur in production:

Marketing and creative assets. Product mockups, ad creatives, social media imagery, blog hero images. Volume is moderate; per-image quality matters; brand consistency is the hard part.
Game and 3D content pipelines. Concept art, texture generation, sprite generation, environment plates. Often the model’s output is a starting point for human artists, not a final asset.
Personalized imagery. Avatar generators, “you in different styles”, custom merchandise. Per-user prompts; high volume; needs a face/identity preserved across generations.
Image editing and inpainting. Remove backgrounds, change lighting, fill masked regions, restyle. Conditioning matters more than the text prompt; the prompt is often short.
Synthetic data for ML training. Generate labeled images for downstream classifiers when real labeled data is expensive or restricted.
Diagram and illustration generation. Lower volume; high specificity. Modern frontier models handle this poorly compared to photo-realistic scenes — text rendering inside images still trips even the best models.

Text-to-image is a poor fit when pixel-exact reproducibility is required, when the output must contain accurate text or numerals, when subject identity must be preserved across many generations without explicit conditioning, or when the legal status of training data forbids the use case.

System overview#

A modern latent-diffusion text-to-image system is a multi-stage pipeline. The image never appears in pixel space until the very end:

[User prompt + optional conditioning]
    │
    ▼
[Prompt preprocessing]
    - safety filter
    - keyword expansion / negative prompt assembly
    - LoRA / style adapter selection
    │
    ▼
[Text encoder (CLIP / T5)]
    text -> embedding vectors
    │
    ▼
[Conditioning encoders (optional)]
    - ControlNet: pose, depth, edges -> control features
    - reference image: IP-Adapter -> identity features
    - mask: inpainting region
    │
    ▼
[Sampler loop in latent space]
    - start from noise (or noised input image for img2img)
    - N denoising steps (typically 20-50)
    - each step: U-Net or DiT predicts noise given (latent, t, text, control)
    - classifier-free guidance scales the conditioning
    │
    ▼
[VAE decoder]
    latent -> pixel image (e.g., 64x64x4 -> 512x512x3)
    │
    ▼
[Postprocessing]
    - super-resolution (optional)
    - face restoration (optional)
    - safety classifier
    - watermark / provenance metadata (C2PA)
    │
    ▼
[Image output]

The compute cost is dominated by the sampler loop. Each denoising step is one full forward pass through a large U-Net or diffusion transformer; 30 steps × 0.1 second per step is a typical end-to-end latency on a modern GPU.

Key components#

Text encoder#

The text encoder turns the prompt into a conditioning vector the diffusion model can attend to. Two patterns dominate:

CLIP-style encoders. The original Stable Diffusion pipeline used CLIP ViT-L. Light, fast, weak on long compositional prompts.
Large text encoders. SDXL, Stable Diffusion 3, and Flux use T5 or T5-XXL alongside or instead of CLIP. Much better at long prompts (“a man holding a sign that says X”) and at compositional reasoning (“a red cube to the left of a blue sphere”). Cost is more parameters and slower text encoding.

Modern systems often use both — CLIP for visual concept alignment, T5 for compositional understanding — and concatenate their embeddings.

Latent diffusion vs. pixel diffusion#

Early diffusion models worked directly in pixel space. Modern systems work in a learned latent space, which is 8 to 48 times smaller per spatial dimension. Why this matters:

A 1024x1024 pixel image is 3 million scalars. The latent is 128x128x4 = 65,536 scalars.
The diffusion U-Net only sees the latent. Compute scales with latent size, not pixel size.
A small VAE encoder/decoder maps between pixel and latent.

This is the single biggest practical reason text-to-image went from “minutes on an A100” (2021) to “seconds on a consumer GPU” (2023+).

The denoising backbone: U-Net vs. DiT#

U-Net — convolutional encoder-decoder with skip connections, attention layers in the deeper blocks. The dominant architecture from 2020 to 2023 (Stable Diffusion 1.5, 2.x, SDXL).
Diffusion Transformer (DiT) — pure transformer, treats the latent as a sequence of patches. Used in Stable Diffusion 3, Flux, and most 2024+ frontier models. Scales better with parameters; better compositional fidelity.

Samplers and step count#

The denoising process is solving an ODE / SDE. The sampler discretizes it:

DDPM — original; many steps (50 to 100), highest quality.
DDIM — deterministic; fewer steps (20 to 50), quality close to DDPM.
Euler / Euler-A / DPM++ — fast samplers; 15 to 30 steps for production quality.
Distilled / consistency models — 1 to 8 steps. LCM, Turbo, Lightning variants. Trade some quality for huge latency wins. Used for real-time UI and high-volume pipelines.

Step count is the primary cost/quality knob. A typical production setting: 30 steps for hero assets, 4 to 8 distilled steps for previews and high-volume features.

Classifier-free guidance (CFG)#

CFG scales how strongly the model follows the prompt. At each denoising step, the model runs twice — once with the text, once without — and the difference is amplified by a scale factor (typically 4 to 12).

Low CFG (1 to 3): looser, more creative, sometimes ignores the prompt.
Mid CFG (5 to 8): the production sweet spot for most models.
High CFG (12+): rigid prompt adherence, often over-saturated or burned images.

CFG doubles the per-step cost (two forward passes). Some distilled models train CFG into the weights and don’t need the second pass at inference.

ControlNet and adapters#

ControlNet introduced structural conditioning: the diffusion model accepts an extra signal (canny edges, depth map, pose skeleton, segmentation mask) and produces an image that respects both the text and the structure. The mechanism is a parallel copy of the encoder that injects features into the main backbone.

ControlNet pose / openpose — generate an image of a person in a specified pose.
ControlNet depth — generate an image with a specified scene geometry.
ControlNet canny / lineart — generate from a sketch.
IP-Adapter — image prompt: generate something in the style or identity of a reference image.
LoRA — low-rank adapters for style or subject. Trained per-style; tiny (a few MB); composable.
Inpainting models — conditioned on a mask and a partial image; generate only inside the mask.

For most production products, the pipeline is base diffusion model + 1 or 2 adapters + occasionally a ControlNet. Stacking too many adapters is a quality killer.

Implementation patterns#

Two-stage generation: base + refiner#

A first pass generates a low-resolution image with fewer steps. A second pass refines at higher resolution with more steps and often a different (refiner) model. SDXL popularized this — the base model handles composition, the refiner adds texture detail. Total cost is lower than running the refiner from scratch at high resolution.

Super-resolution upscaling#

Generate at 1024x1024, then upscale to 2048x2048 or 4096x4096 with a dedicated SR model (ESRGAN family, latent upscalers). Much cheaper than generating directly at high resolution and often produces sharper output because the SR model is trained for detail.

Negative prompts#

A separate prompt describing what not to generate (“blurry, low quality, watermark, extra fingers”). Implemented as a second conditioning vector subtracted via CFG. Heavily used in early Stable Diffusion; less critical in modern models that have better composition out of the box, but still common in production for brand-safety constraints.

Prompt augmentation#

The user’s literal prompt is often short (“a cat”). A small LLM expands it before the text encoder (“a high-resolution photograph of a calico cat sitting on a windowsill, soft afternoon light, shallow depth of field, professional photography”). Quality improves dramatically on short inputs; consistency drops if the augmentation is too aggressive.

Identity preservation across generations#

Many products need the same character across multiple images (a user’s avatar in different scenes). Two approaches:

Per-user LoRA. Fine-tune a small adapter on 5 to 20 reference images of the subject. Trained in a few minutes; identity is preserved well; per-user storage cost.
IP-Adapter / face-embedding conditioning. Encode the reference image once at inference. No training; identity preservation is weaker than LoRA but operationally much simpler.

Trade-offs#

Hosted API (DALL·E, Midjourney, Imagen, Firefly, Flux Pro) — top quality, fastest path to ship, per-image cost. Limited control over weights, adapters, and fine-tuning. Provider safety policies bind you.

Self-hosted open-weights (Stable Diffusion, Flux dev, SDXL) — full control, on-prem option, custom LoRA / ControlNet stacks. Capex on GPUs, ops burden, and the quality gap is real on the hardest prompts.

Other axes:

Latency vs. quality. 30-step samples take 3 to 6 seconds on a single H100; 4-step distilled samples take under a second. The choice is per-feature: hero assets get 30 steps; preview tiles get 4.
Single image vs. batch. Diffusion sampling on a GPU is much more efficient per-image when batched (4 to 8 images at once). Real-time UX wants single-image streaming previews; batch APIs serve creative workflows.
Resolution vs. cost. Quadratic memory scaling means 2048x2048 costs roughly 4 times what 1024x1024 does. Generate small, upscale later.
Photorealism vs. stylized. Photoreal models (Midjourney v6, Imagen 3, Flux Pro) and stylized models (Niji, anime-tuned SDXL variants) are often different checkpoints. Routing per prompt intent beats one general model.

Quality and evaluation#

Image generation evaluation is harder than text because human aesthetic judgment is fuzzy and references rarely exist. The pieces that work:

Automated metrics with limits.
- FID (Fréchet Inception Distance) — distance between generated and reference image distributions. Population-level; doesn’t grade individual prompts.
- CLIP-Score — cosine similarity between the prompt’s CLIP embedding and the image’s CLIP embedding. Cheap, useful for ranking, weak on compositional prompts.
- DSG (Davidsonian Scene Graph) / TIFA — decompose the prompt into questions (“is there a cat?”, “is the cat sitting?”), use a VLM to answer them on the generated image. The best automated proxy for prompt adherence.
Human evaluation. Side-by-side preference (two images, pick the better one) is the gold standard. Sized for statistical significance: a few hundred pairs per model variant. Expensive; necessary for major changes.
Safety classifiers. Per-category (NSFW, violence, hate, public-figure, brand-impersonation). Precision-recall curves per class; production thresholds set per use case.
Production telemetry. Regeneration rate (users hit “try again”), save rate, share rate, refusal rate, latency p50 / p95. Regenerate-rate is often the most actionable single metric.

For brand-critical assets, the eval also includes “consistency suites”: fixed prompts run weekly across model versions to detect quality regressions.

Common pitfalls#

Text rendering. Models render legible text inside images badly. Don’t make a feature whose value depends on accurate in-image text without a typography-aware model or a postprocessing step that overlays real text.
Hands and feet. Improving but still failure-prone. Use ControlNet pose when hands matter; budget for face/hand restoration passes.
CFG too high. Easy to crank up looking for “more prompt-following” and end up with burned, oversaturated images. Stay in 5 to 8 for most pipelines.
Long prompts that contradict. Throwing every desired quality at the prompt (“photorealistic, anime, oil painting, watercolor, 8k”) produces incoherent output. Pick a style.
Skipping the safety classifier on edit features. Inpainting and img2img can generate content the user’s prompt didn’t ask for. Run safety on output, not just input.
No watermarking or provenance metadata. C2PA Content Credentials and visible watermarks are increasingly expected. Add them at generation time, not as an afterthought.
Letting users upload reference images without legal review. Identity preservation features that work on uploaded photos have real consent and policy implications. Build the consent flow before the feature.
Ignoring batch APIs. Per-image hosted cost differs by 30 to 50% between sync and batch endpoints. For non-interactive workflows, use batch.

Why 1024x1024 became the standard

Early Stable Diffusion shipped at 512x512 because that’s what the training compute budget allowed. SDXL trained natively at 1024x1024 and the quality jump was enormous — not because of resolution per se, but because the model had four times the spatial real estate to lay out composition. Most modern frontier image models train at 1024x1024 native, generate there, and upscale to higher resolutions with a separate model. Generating directly at 2048x2048 with a 1024-trained model is a common mistake — the model has never seen scenes at that scale, so composition breaks, faces multiply, and limbs grow. The pattern that works: generate at the model’s training resolution, upscale separately.