Vision Models (CNN → ViT) — Gen AI

Origin and intuition#

For 50+ years, computer vision was a hand-engineered pipeline. SIFT, HOG, SURF features; bag-of-visual-words; SVMs on top. Each component reflected a hypothesis about what visual structure mattered — edges, gradients, corners, spatial pyramids. The pipelines worked on narrow benchmarks and brittle in the wild.

Convolutional neural networks (CNNs), proposed by LeCun for handwritten digit recognition in the late 1980s and scaled by Krizhevsky, Sutskever, and Hinton’s AlexNet on ImageNet in 2012, replaced the entire feature-engineering stack with learned features. The key inductive biases were (a) local connectivity — early layers only see local pixel neighborhoods, mirroring the receptive field of the visual cortex; (b) weight sharing — the same convolutional filter slides across the image, encoding translation equivariance; (c) hierarchical composition — stacks of convolutions build up features from edges to textures to object parts to whole objects.

AlexNet won ImageNet 2012 by 10 percentage points. By 2015, ResNet (He et al.) had pushed accuracy past human-level on ImageNet classification using residual connections to train networks 50-152 layers deep. CNNs dominated vision for the next decade: VGG, GoogLeNet/Inception, ResNet, ResNeXt, EfficientNet. The architectural progression was about depth, parameter efficiency, and training tricks; the core convolutional inductive bias was assumed.

Then in 2020, Dosovitskiy et al. at Google published “An Image is Worth 16×16 Words” — the Vision Transformer (ViT). The proposal: split the image into a grid of 16×16 patches, flatten each patch to a vector, treat the resulting sequence the same way a transformer treats text tokens. No convolutions at all. With enough pretraining data (300M images, JFT-300M), this plain transformer matched or exceeded the best CNNs on ImageNet. The strong inductive biases of CNNs — local connectivity, translation equivariance — turned out to be helpful at small data scales but unnecessary, and eventually limiting, at large scales.

By 2022-2023, ViT and its descendants had become the default backbone for image classification, object detection (DETR, DINO), segmentation (SAM), and the vision side of multimodal models (CLIP, Flamingo, GPT-4V, Gemini). CNNs remain competitive in efficiency-constrained settings (mobile, edge), but the frontier shifted decisively to transformers. The 2024-2026 trend is hybrid — convolutional stems feeding transformer trunks, or hierarchical transformers (Swin) that recover some of the locality bias the original ViT discarded.

Inputs and outputs#

A vision model consumes an image and produces some structured output. The interface depends on the task:

Classification. Input: image. Output: probability distribution over K classes (1000 for ImageNet, 21K for ImageNet-21K, custom for fine-tuned models).
Object detection. Input: image. Output: a set of (bounding box, class, confidence) tuples.
Segmentation. Input: image. Output: per-pixel class label (semantic) or per-pixel instance ID (instance).
Embedding / representation. Input: image. Output: a fixed-size feature vector, used downstream for retrieval, similarity search, or as input to a multimodal model.
Backbone for generation. Same shape as embedding, but consumed by a generative decoder (diffusion U-Net, autoregressive image transformer).

Pixel input. A standard input is an RGB image at fixed resolution (224×224, 384×384, 512×512 for CNNs and ViTs; 1024×1024+ for detection/segmentation models). The image is normalized per-channel (mean/std from ImageNet for transfer learning).

Patch input (ViT). A 224×224 RGB image with 16×16 patches yields (224/16)² = 196 patches. Each patch is 16×16×3 = 768 numbers; linearly projected to d_model to form the input embedding. Add positional embeddings, prepend a [CLS] token, feed to the transformer.

Architecture diagram#

CNN backbone (ResNet-style). Stacked convolutional blocks with residual connections, downsampling at stage boundaries:

Input image (224×224×3)
   │
   ▼
 ┌──────────────────────────────────────┐
 │ 7×7 conv, stride 2 + BatchNorm + ReLU│
 │ 3×3 maxpool, stride 2                 │
 └──────────────────────────────────────┘
   │  56×56×64
   ▼
 ┌──────────────────────────────────────┐
 │ Stage 1: 3 residual blocks (64 ch)    │
 └──────────────────────────────────────┘
   │  56×56×64
   ▼
 ┌──────────────────────────────────────┐
 │ Stage 2: 4 residual blocks (128 ch)   │
 │   first block downsamples to 28×28    │
 └──────────────────────────────────────┘
   │  28×28×128
   ▼
 ┌──────────────────────────────────────┐
 │ Stage 3: 6 residual blocks (256 ch)   │
 │   first block downsamples to 14×14    │
 └──────────────────────────────────────┘
   │  14×14×256
   ▼
 ┌──────────────────────────────────────┐
 │ Stage 4: 3 residual blocks (512 ch)   │
 │   first block downsamples to 7×7      │
 └──────────────────────────────────────┘
   │  7×7×512
   ▼
 Global average pool  →  Linear classifier  →  1000 logits

 Residual block:
    x ──▶ 3×3 conv ──▶ BN ──▶ ReLU ──▶ 3×3 conv ──▶ BN ──▶ + ──▶ ReLU
    │                                                       ▲
    └───────────────────── identity ────────────────────────┘

Vision Transformer (ViT). Patchify, embed, then a plain transformer encoder stack:

Input image (224×224×3)
   │
   ▼
 ┌──────────────────────────────────────┐
 │ Split into 14×14 = 196 patches        │
 │ Each patch: 16×16×3 = 768 values      │
 └──────────────────────────────────────┘
   │
   ▼
 Linear projection (768 → d_model, e.g. 768/1024/1280)
   │
   ▼
 Prepend [CLS] token  +  Add learned positional embeddings
   │
   ▼  (197 tokens × d_model)
 ┌──────────────────────────────────────┐
 │ Transformer encoder block × L         │
 │   (LN → MSA → +res → LN → MLP → +res) │
 └──────────────────────────────────────┘
   │  (197 tokens × d_model)
   ▼
 Take [CLS] token  →  LayerNorm  →  Linear  →  K logits

 Sizes:
  ViT-B/16  : 12 layers, d=768,  12 heads,  86M params
  ViT-L/16  : 24 layers, d=1024, 16 heads, 307M params
  ViT-H/14  : 32 layers, d=1280, 16 heads, 632M params

Hybrid hierarchical transformer (Swin, ConvNeXt). Compromise between the two — early stages have local attention windows or small convolutions, later stages have global attention, resolution downsamples like a CNN:

Image  →  Patch embed  →  Stage 1 (windowed attn, fine res)
                      →  Patch merging (downsample 2×)
                      →  Stage 2 (windowed attn, shifted windows)
                      →  Patch merging
                      →  Stage 3 (windowed attn, more channels)
                      →  Patch merging
                      →  Stage 4 (global attn, coarse res)
                      →  Head

This recovers the multi-scale feature pyramid that detection and segmentation heads benefit from, while keeping transformer flexibility.

Training objective#

The objective depends on the task and the pretraining recipe:

Supervised classification. Cross-entropy over K classes. The historical workhorse — ImageNet-1K (1.28M images, 1000 classes) for academic benchmarks; JFT-300M / JFT-3B for industrial-scale pretraining at Google.
Self-supervised contrastive (SimCLR, MoCo). Generate two augmented views of the same image, train so their embeddings are close while embeddings of other images are far. No labels needed. Useful for representation learning when labels are scarce.
Self-supervised masked image modeling (MAE, BEiT, SimMIM). Mask out ~75% of patches and ask the model to reconstruct them. The vision analog of BERT’s MLM. Particularly effective for ViTs since patches are a natural masking unit.
Image-text contrastive (CLIP). Train an image encoder and a text encoder jointly such that paired image and caption embeddings are close. Pretrain on ~400M image-text pairs from the web. The resulting image encoder is one of the most useful frozen backbones in modern vision.
DINO / DINOv2 self-distillation. Student network tries to match a teacher network’s output on different augmentations. Produces remarkably strong frozen features without any labels.

The pretraining regime determines what the model is useful for downstream. CLIP encoders are the default for retrieval and multimodal models. DINOv2 features are state-of-the-art for many dense prediction tasks. MAE-pretrained ViTs are strong starting points for detection and segmentation fine-tuning.

The vision-model landscape between 2012 and 2026:

CNN era (2012-2020):

AlexNet (2012). 8 layers, ReLU, dropout, GPU training. Won ImageNet 2012 by 10 percentage points.
VGG (2014). 16-19 layers, all 3×3 convolutions, very uniform structure. The first network to demonstrate that depth alone helps a lot.
GoogLeNet / Inception (2014). Inception modules with parallel multi-scale convolutions. Better accuracy per FLOP.
ResNet (2015). Residual connections enabled training 50-152 layer networks. The architectural pattern (skip connections) is now everywhere — transformers, U-Nets, every modern deep network.
DenseNet (2016). Every layer feeds into every subsequent layer. Memory-hungry but parameter-efficient.
MobileNet / EfficientNet (2017-2019). Depthwise-separable convolutions and compound scaling. Strong accuracy-per-FLOP for mobile deployment.
ConvNeXt (2022). Modernized ResNet design — large kernels, layer norm, GELU, fewer activations. Showed that CNNs aren’t intrinsically worse than ViTs at the same compute; the gap was about training recipes and design choices.

Transformer era (2020-present):

ViT (2020). The original. Plain transformer over patches. Needs lots of pretraining data to match CNNs.
DeiT (2021). Data-efficient training recipe for ViT. Closed the data gap somewhat — ViT-S trained on ImageNet-1K alone competitive with ResNet-50.
Swin Transformer (2021). Hierarchical, windowed attention. Restored multi-scale feature pyramids for dense prediction.
MAE (2022). Masked autoencoder pretraining. Mask 75% of patches, reconstruct them. Strong self-supervised pretraining.
DINOv2 (2023). Frozen-feature king. Self-distilled ViTs that work out-of-the-box for dense prediction.
SAM, SAM 2 (Segment Anything, 2023-2024). ViT backbone + prompt-conditioned decoder. Zero-shot segmentation on essentially any image.
CLIP / OpenCLIP (2021-2024). Image-text contrastive at scale. The default vision encoder for multimodal LLMs.

CNN (ResNet-style) — strong inductive biases (locality, translation equivariance), trains well on small data, efficient for mobile/edge deployment. Saturates at moderate scale; doesn’t benefit as much from massive data and compute.

ViT — weaker inductive biases, needs more pretraining data to match CNNs, but scales much better. At large scale (> 100M images, > 300M parameters) dominates CNNs on classification, detection, and segmentation. Default for multimodal vision encoders.

Practical considerations#

Data scale determines which architecture wins. Below ~1M training images, well-tuned CNNs (or modern ConvNeXt-style designs) match or beat ViTs. Above ~100M images, ViTs consistently win. The crossover is roughly where the CNN’s locality bias stops being a head start and starts being a limitation. If you’re doing supervised fine-tuning on a few thousand images, start with a CNN backbone or a pretrained ViT — don’t train a ViT from scratch.

Resolution scaling. CNN compute scales linearly with the number of pixels (FLOPs ∝ H × W). ViT compute scales quadratically with the number of patches via the attention matrix — O(N²) where N = (H/p)(W/p). Doubling resolution quadruples ViT attention cost. For high-resolution applications (medical imaging, satellite imagery, document understanding), hierarchical/windowed transformers (Swin) or hybrid CNN-transformer architectures avoid the quadratic cost.

Pretraining is mandatory at frontier scale. No one trains frontier vision models from scratch. The standard is: take a pretrained backbone (CLIP, DINOv2, SAM, MAE), fine-tune on the target task. Frozen-feature transfer (no backbone updates, just train a head) often gets within 1-3% of full fine-tuning at a fraction of the compute.

Augmentation matters more than people credit. RandAugment, Mixup, CutMix, random erasing, and similar tricks routinely contribute 1-3% accuracy on ImageNet. For ViTs especially, strong augmentation is part of why they train on smaller datasets than the original ViT paper suggested.

Patch size is a hyperparameter, not a constant. ViT-B/16 uses 16×16 patches; smaller patches (ViT-B/8) capture more detail but cost more compute. The right patch size depends on the smallest meaningful feature in your domain — text-in-images needs smaller patches; medical pathology slides need still smaller.

Real-world deployments#

Vision models are pervasive in production:

Search and retrieval. Google Image Search, Bing Visual Search, Pinterest visual search, Amazon visual search — all use CNN or ViT embeddings for nearest-neighbor lookup in massive image indices.
Medical imaging. Radiology AI (chest X-ray triage, diabetic retinopathy screening, dermatology), pathology slide analysis, ophthalmology. CNNs and ViTs both deployed; resolution and per-image annotation costs favor hybrid architectures.
Autonomous vehicles. Tesla Autopilot, Waymo, Cruise — multi-camera vision stacks with CNN backbones, transformer-based perception heads (BEVFormer, occupancy networks), increasingly end-to-end neural planning.
Content moderation. Meta, YouTube, TikTok, Pinterest — fine-tuned CNNs/ViTs classifying every uploaded image at scale.
Face recognition, identity verification. ArcFace and descendants — ResNet/ViT backbones trained with angular margin losses. iPhone Face ID, banking onboarding KYC, airport biometrics.
Manufacturing QA. Defect detection on production lines, agriculture (crop disease identification), retail shelf monitoring. Often fine-tuned EfficientNet or YOLO derivatives.
Vision branches of multimodal LLMs. GPT-4V, Claude 3+ vision, Gemini, LLaVA — all use a CLIP-style ViT encoder feeding into the language model. The vision encoder is usually frozen or only lightly fine-tuned.
Diffusion model backbones. Stable Diffusion’s U-Net is a convolutional backbone with attention; SD3 and Sora use full transformer (DiT) backbones. The image-generation side of the stack runs on the same families of vision architectures as the perception side.

The CNN-to-ViT transition isn’t complete. Mobile deployments still favor lightweight CNNs (MobileNet, EfficientNet); on-device camera processing in phones runs convolutional pipelines. But everywhere there’s significant compute available, the trajectory is transformers — and increasingly, transformers as components of multimodal language models rather than standalone image classifiers.

Why the inductive-bias story is more nuanced than 'transformers win'

The popular narrative is “transformers beat CNNs because they’re more flexible”. The actual story is more nuanced. CNNs encode strong priors — locality, translation equivariance, hierarchical composition — that match how natural images are structured. These priors give CNNs a head start at small data scales. Transformers encode much weaker priors (permutation-equivariance up to positional encoding) so they need more data to learn the same regularities. At ~10M images and below, the CNN’s priors are correct and helpful, and CNNs win. At ~100M+ images, the priors are still mostly correct but they’re now redundant — the transformer has learned them from data — and they’re also limiting, preventing the model from learning the corner cases where the priors don’t fit (rotated objects, non-translation-invariant texture). So the right answer isn’t “transformers are better” but “the optimal inductive bias depends on the data scale, and we crossed the threshold where weak-priors wins around 2020”. The hybrid designs (Swin, ConvNeXt-V2, hierarchical ViTs) are a recognition that you want the priors back, just not the convolutional rigidity. ConvNeXt is “what if a CNN were redesigned with everything we learned from training ViTs”, and it’s competitive — at the same scale, the architectural family matters less than the training recipe.