Multimodal Models — Gen AI · Engineering Playbook

Summary#

A multimodal model accepts (and often produces) more than one type of input — text plus images, plus audio, plus video — through a shared backbone. The engineering trick is consistent across modalities: convert every input into a sequence of tokens or embeddings that live in a single representation space, then run the same transformer that already worked on text. CLIP (2021) made this idea practical by aligning image and text embeddings; Flamingo (2022) showed you could interleave them in a single generative model; GPT-4o, Gemini, and Claude-3 productionised it.

There are two main architectural patterns: late fusion (separate encoders per modality, embeddings combined near the top — CLIP, dual-encoder retrievers) and early fusion (modalities tokenised and fed into a single transformer from the start — Flamingo, GPT-4o, Gemini). Late fusion wins on retrieval and classification; early fusion wins on generation and reasoning across modalities.

Why it matters#

In 2023, image-and-text was a frontier capability. By 2026, it’s table stakes — every flagship model is multimodal at the input level, most are multimodal at the output level too. Product teams expect to send an image and ask a question about it, send an audio clip and get a transcript with reasoning, send a screenshot and get UI automation steps. Engineering teams that don’t understand the multimodal pipeline ship products that get out-competed.

The pattern also matters because of where it’s not used yet. Most enterprise data is multimodal (PDFs with diagrams, spreadsheets with charts, support tickets with screenshots), but most enterprise AI pipelines still treat everything as text. The teams that build proper multimodal pipelines — vision + text retrieval, video understanding, audio + text reasoning — have a measurable quality advantage on real workloads.

How it works#

The shared-token-space trick#

The conceptual move that unlocks multimodality is treating every modality as a sequence of tokens or embedding vectors:

Text: tokenised via BPE / SentencePiece, lifted to embeddings.
Images: split into patches (typically 16x16 or 14x14 pixels), each patch encoded by a Vision Transformer (ViT) into a vector. A 224x224 image becomes ~256 patch embeddings.
Audio: converted to a mel-spectrogram and patchified, or tokenised by a discrete audio codec (Encodec, SoundStream).
Video: spatio-temporal patches (tubelets) or a sequence of image-patch sequences with temporal embeddings.

Once everything is a sequence of embeddings, you can interleave them in a single transformer’s context. The model learns cross-modal attention as a side effect of the pretraining objective — text tokens attend to image patches, audio tokens attend to text, and so on.

CLIP and the alignment objective#

OpenAI’s CLIP (Contrastive Language-Image Pretraining, 2021) was the first scalable demonstration that text and image embeddings could live in the same space. Two encoders — a ViT for images, a text transformer for captions — were trained on ~400M image-text pairs from the web. The loss was contrastive (InfoNCE): for each image in a batch, pull its caption’s embedding close and push the other captions’ embeddings away (and vice versa).

The result was a similarity engine: given an image, you can find the closest caption; given text, you can find the closest image. CLIP itself doesn’t generate — it embeds — but its embedding space underwrote the next generation of text-to-image generation (Stable Diffusion uses CLIP’s text encoder for conditioning).

SigLIP (2023) refined CLIP with a sigmoid loss (better calibration), and modern multimodal models often use SigLIP or a SigLIP-style encoder as their vision backbone.

Vision-language generation: Flamingo, LLaVA, GPT-4o#

Generation models need more than embeddings — they need to write text conditioned on images. Two patterns:

Adapter / projector pattern (LLaVA-style): take a pretrained vision encoder (CLIP, SigLIP) and a pretrained language model. Train a small MLP “projector” that maps vision embeddings into the language model’s embedding space. Fine-tune on image-instruction pairs. Cheap, modular, the default open-source approach in 2026 — Llava, BakLlava, MiniGPT-4, Qwen-VL all follow this pattern.

Native multimodal pretraining (Flamingo, Gemini, GPT-4o): pretrain a single model from scratch on interleaved text and image (and audio, and video) data. The model has cross-attention layers between the vision tower and the language stack from day one. More expensive to train but produces tighter cross-modal reasoning — the model can do things like “describe this chart and continue the analysis from where the chart ends” in a way that adapter-glued models struggle with.

Audio and speech: the third modality#

Speech and audio entered the multimodal stack later than vision. The same patterns apply:

Speech recognition + text LM — the conservative approach. Whisper or a similar ASR model transcribes audio to text, the LM reasons over text. Loses prosody, tone, non-speech audio.
Native audio-in models (GPT-4o, Gemini-1.5) — audio is tokenised by a neural codec (Encodec, SoundStream) and fed directly into the transformer. The model reasons over phonemes, tone, background noise. Much better at understanding emotion and acoustic context.
Native audio-out models (GPT-4o speech, Gemini Live) — the model generates audio codec tokens that a vocoder decodes to waveform. Latency drops dramatically because there’s no separate TTS stage.

A practical use case that pays for itself fast: search across modalities. With aligned embeddings (CLIP-style), you can:

Search image collections by text query.
Find similar images to a given image.
Retrieve documents by an image of the document (visual document search).
Build hybrid RAG that retrieves both text and images for the same query.

The retrieval encoder is usually different from the generation model — CLIP / SigLIP / E5-V / Cohere-Multimodal-Embed for retrieval, the frontier multimodal LLM for generation. The split mirrors text RAG: small fast encoder for retrieval, large smart model for answering.

Variants and trade-offs#

Late fusion (CLIP, dual-encoder) — separate encoder per modality, fused near the top. Fast (encoders can run independently), efficient for retrieval (precompute one side, query the other), strong on classification and similarity. Weaker on generation and fine-grained reasoning across modalities.

Early fusion (Flamingo, Gemini, GPT-4o) — modalities tokenised and interleaved in a single transformer. Strong on generation and cross-modal reasoning (“describe what’s wrong with this code in the screenshot”). Heavier compute, less efficient for retrieval, harder to train.

Other practical axes:

Modality count. Two-modality (vision-language) is well understood and broadly deployed. Three-plus (vision + language + audio + video) is still differentiating — GPT-4o, Gemini, and recent Claude releases support it; most open-weights models are vision-language only.
Generation reach. Most multimodal models accept many modalities as input but only emit text. True any-to-any models (text in, image out; image in, audio out) are rarer — they require the output decoder for each modality, which roughly doubles model complexity per output type.
Resolution and detail. High-resolution image input (>1024x1024 effective pixels) is a meaningful capability gap between models. Document-understanding workloads in particular need high resolution to read fine text reliably.
Video understanding is still hard. Frame-by-frame sampling loses temporal coherence; native video tokenisation is compute-expensive; long videos blow context budgets quickly. Most “video understanding” in production is really sampled-frame understanding plus an audio transcript.

Why training on web-scraped image-text pairs barely works at all

CLIP trained on ~400M image-text pairs scraped from the web. Most of these pairs are noisy — the alt-text doesn’t really describe the image, the caption is generic, the image is duplicated, the language varies. By all rights it shouldn’t work. It does work because the noise mostly cancels in expectation: across hundreds of millions of pairs, the signal (that an image of a cat is more often near the word “cat” than near “bicycle”) wins. The lesson reappears at every scale in modern AI — large amounts of weakly-labelled data, processed in a contrastive or self-supervised objective, can produce stronger representations than smaller amounts of carefully-labelled data. The economic implication is enormous: there’s effectively unlimited training data for vision-language alignment, you just have to scale and clean it.

When this is asked in interviews#

This is a senior question on AI-product, multimodal-engineering, and AI-platform loops. It also appears in research-leaning loops because the alignment-and-fusion question is still active research.

What the interviewer is checking:

Can you explain the shared-token-space pattern — every modality reduced to a sequence of embeddings.
Do you understand late vs early fusion and when to pick each.
Can you design a real multimodal product pipeline — picking the right encoder, the right generator, and the right eval.

Common follow-ups:

“How would you build image search over a 10M-image catalogue?” — CLIP / SigLIP encoder, vector store (Faiss, ScaNN, pgvector), HNSW index, hybrid with text metadata. Don’t reach for the frontier multimodal LLM at retrieval time — it’s 100x too expensive.
“What’s the difference between OCR and a multimodal LLM reading text in an image?” — OCR is deterministic and fast but loses layout context and can’t reason about content. A multimodal LLM understands meaning but is slower and may hallucinate text that isn’t there. Production stacks often combine: OCR for ground-truth text, LLM for reasoning.
“Why are video models so much harder than image models?” — temporal coherence, compute scaling with frame count, the lack of large clean video-text datasets, and the long-context shape that explodes attention cost.