The Future of Generative AI — Gen AI

Summary#

In 2017 the field had transformers. By 2020 it had pretraining at scale. By 2023 it had instruction-following and chat. By 2025 it had reasoning models and tool-use agents. Each of those transitions felt sudden inside its 6–12 month window and obvious in retrospect.

In May 2026, five trends are visible enough that betting on them is reasonable: reasoning-trained models becoming the default, autonomous agents shifting from demos to production, multimodality merging into the model rather than bolting on, on-device inference closing the gap with frontier APIs, and the field bumping into a compute wall that’s reshaping how progress is funded and distributed.

This is a forward-looking piece. Read it as a snapshot of which way the field is moving, not a forecast of where it lands.

What’s changing#

Reasoning-trained models become the default#

The pattern that started with OpenAI’s o1 in late 2024 — train the model with reinforcement learning to produce long internal reasoning traces before answering — has propagated to every frontier lab. By mid-2026 it is no longer a separate “reasoning” mode; it’s the default behaviour of frontier models, with the trace either hidden, surfaced, or token-budgeted by the caller.

The engineering implication is that latency budgets get noisier. A reasoning model on a hard problem can emit thousands of “thinking” tokens; on an easy one it emits dozens. Per-call latency variance has gone up by about an order of magnitude, and product UI has had to adapt — streaming the reasoning, showing progress indicators, or running fast paths in parallel with slow ones and merging.

Agents become deployable, not just demo-able#

For most of 2023 and 2024, agent demos worked once on stage and broke twice in production. The combination of better tool-use training, longer effective context, retry/replanning loops, and structured environments (browsers and IDEs designed for agents, not just for humans) has shifted that.

What’s working in production now:

Coding agents operating against version control with isolated work trees, running tests as part of their loop, opening pull requests. The shift from “complete this function” to “complete this task” has changed how teams structure repositories.
Browsing agents that can complete multi-step web flows — book a flight, file a form, gather research — within a defined scope. Reliability outside that scope is still poor.
Workflow agents that orchestrate fixed business processes with branching: account reconciliation, refund handling, customer onboarding. These look more like robust state machines with LLM nodes than like the “open-ended autonomous agent” the term used to imply.

The broader autonomous-agent vision — give a model a goal and it figures everything out — is still mostly aspirational. The deployable shape is constrained scope plus a strong sandbox plus human review on consequential actions.

Multimodality merges into the model#

Until recently, “multimodal” meant separate vision/audio encoders feeding a language model. The 2025–2026 generation of frontier models is natively multimodal: trained end-to-end on interleaved text, images, audio, and video tokens. The architecture is unified, the tokenizer is unified, and at inference any combination of modalities can be input and output.

The practical consequences:

Document understanding stops being OCR + LLM and becomes one model. Layout, charts, and figures stay coherent with the surrounding text.
Voice interfaces stop being ASR + LLM + TTS three-pipeline sandwiches and become one model that ingests audio and emits audio. Latency drops; emotional prosody survives the round-trip.
Video is the harder frontier. Generation quality is impressive at clip length; long-form coherence (minutes, not seconds) is still unsolved. Understanding is moving faster than generation.

On-device inference closes a meaningful gap#

Three things converged: model architectures designed for mobile (Phi, Gemma, MiniCPM, Qwen-small), aggressive quantization (4-bit and even 2-bit weights at near-full quality), and dedicated NPU hardware in consumer devices. A 7B-class model now runs on a flagship phone with reasonable latency; a 30B-class model runs on a recent laptop without a discrete GPU.

This doesn’t replace frontier API calls — the capability gap to the top of the curve is real and growing — but it changes the architecture of products. Many features that previously required a round-trip (autocomplete, simple summarization, voice transcription, on-device search) now run locally with no token bill, no privacy concession, and no offline failure mode.

Most production stacks in 2026 are tiered: on-device for cheap and private operations, mid-tier hosted for routine work, frontier API only when needed.

The compute wall#

Training a frontier model is now a multi-hundred-million-dollar undertaking. The number of organizations that can do it at all is single-digit. Scaling laws still hold — bigger plus more data plus more compute does still give more capability — but each doubling now costs roughly double, while the marginal value of the next capability tier compresses.

Several things are happening at once in response:

Heavier post-training spend. A bigger fraction of total cost is going into RLHF/RLAIF, reasoning training, and tool-use training rather than raw pretraining FLOPs. Data quality is taking budget from data quantity.
Distillation and small models. The frontier model is the teacher; the deployed model is a distilled student. The economic motive is unavoidable.
Specialized hardware. Custom training accelerators (TPUs, Trainium, Maia), inference accelerators (Groq, Cerebras), and the underlying memory-bandwidth race (HBM generations) are now the bottleneck for who can train what.
Test-time compute as a lever. “Pay more inference compute to get a better answer” — used heavily in reasoning models — is a way to extract capability without growing the model. It moves cost from training (a one-time bill) to inference (a per-call bill), which changes who can afford which capability.

Open problems#

Things that are still genuinely unsolved as of mid-2026:

Reliable long-horizon agency. Agents that operate for hours or days with high reliability on open-ended tasks. Current systems break down past 10–30 minute horizons without human steering.
Hallucination calibration. Models that know what they don’t know and refuse or defer rather than confabulate. Progress is real but not solved; production systems still need external verification.
Continual learning. Updating a deployed model with new knowledge without retraining and without losing old capability. RAG fills part of the gap; the underlying problem of weight updates is still open.
Long-context coherence. Million-token context windows exist; reasoning across them coherently does not. Most long-context benchmarks are needle-in-haystack tasks, which is easier than synthesizing across the whole context.
Evaluation at scale. Knowing whether the new model version is better than the old, on the dimensions you care about, without weeks of human evaluation. LLM-as-judge helps; it’s not enough.
Safety against capable misuse. Models that refuse harmful asks reliably, including ones embedded in prompts, retrieved documents, or tool outputs. Prompt-injection through retrieved content is the most-exploited failure surface in 2026.

Risks and mitigations#

The risk surface has changed shape since 2023 — the early frame of “models say wrong or harmful things” is still real but is now the easier half of the problem. The harder half is what happens when those models take actions.

What teams are watching:

Agentic risk. Models that can move money, send messages, modify files, or call APIs need stronger guardrails than models that just emit text. Capability gating (require human review for irreversible actions), sandboxing (run in isolated environments), and structured action surfaces (only allow specific verbs) are the practical tools.
Indirect prompt injection. Untrusted text in retrieved documents, browsed pages, or tool outputs becoming instructions the model follows. The defense is structural — never let the model treat content from an untrusted channel as instructions — but is hard to enforce in long agentic loops.
Model and data provenance. Training data licensing, output watermarking, deepfake detection. Mostly unsolved at scale; legal and regulatory pressure is real and growing.
Concentration of capability. A few labs at the frontier, with a real and widening gap to the rest. The mitigation isn’t technical; it’s policy, open-weights ecosystems, and antitrust.
Energy and compute footprint. Inference at frontier-model volume has measurable grid impact. Datacenter siting, power contracts, and efficiency improvements are now first-class engineering concerns at the largest providers.

The optimistic frame: agents take over toil, on-device inference democratizes capability, multimodality unlocks accessibility, reasoning models make hard analytical work cheaper. Each tier of capability becomes affordable to a wider base over time.

The realist frame: capability concentrates at a few labs; the deployment surface gets harder to secure as agents take consequential actions; evaluation lags capability; trust is lost faster than it’s regained. The technology gets better and harder to govern simultaneously.

Both are partially true. Which one dominates depends less on the technology than on the institutional and product choices made around it.

What to watch#

A short list of things worth tracking through 2026 and into 2027:

The price-per-capability curve. Token costs for a given capability level have fallen ~10× per year for three years. If that continues, more product surfaces become economic. If it flattens, fewer do.
Reasoning-model evaluation benchmarks. New benchmarks come and saturate quickly; the meta-question is whether the field has any benchmark that doesn’t saturate within a year of release.
Open-weights frontier. How far behind the closed-frontier the strongest open-weights model is. The gap was ~6 months in early 2024, closer to 12 by 2025; the 2026 trend isn’t yet clear.
Agentic-platform consolidation. Which environments — browsers, IDEs, OSes — get first-class agent APIs and which stay scraping-based. The shape of the agent ecosystem in 2027 is mostly being decided here.
Regulation. The EU AI Act compliance deadlines, US executive actions, sectoral rules in finance and healthcare. The regulatory shape changes which deployments are viable and which costs accrue to which party.

A short note on humility

Every snapshot of “where AI is going” written in the last decade has had a 30% hit rate at best. The 2017 predictions missed the transformer’s dominance. The 2020 predictions missed instruction-tuning. The 2023 predictions missed reasoning models. The honest version of this writeup is: read it as a 2026 vantage point, expect the unknown-unknown to be the most important thing twelve months from now, and treat the rest as scaffolding for thinking — not a forecast.