AI Safety and Alignment

RLHF, constitutional AI, red-teaming, refusal training. The engineering practices behind not-shipping-something-harmful.

Reflection Intermediate
8 min read
safety alignment rlhf

Summary#

AI safety and alignment, as a field, is a set of techniques and practices for making models behave the way their developers intend — and, crucially, not in ways their developers did not intend. The discipline spans training-time work (RLHF, constitutional AI, supervised fine-tuning on safety-relevant data), inference-time work (refusal training, content filtering, structured output constraints), and process work (red-teaming, evaluation, incident response).

In 2026 this is a mainstream engineering concern, not a niche specialisation. Every team deploying a frontier model is doing some version of all three layers. The interesting work has moved from “can we make the model refuse obvious harms” — which is mostly solved — to “can we keep agents safe when they take consequential actions on behalf of users” — which is not.

What’s changing#

Three things have changed substantively since 2023.

The threat model has shifted from harmful outputs to harmful actions. When the deployed model only emitted text, the worst case was a user reading something they should not have. Now models call tools, write to databases, send emails, control browsers, and execute code. The blast radius of a single bad decision is much larger, and the failure modes look more like classical security incidents (privilege escalation, data exfiltration) than like content moderation.

Indirect prompt injection has become the dominant attack surface. When a model reads a webpage, retrieves a document, or processes a tool output, that content can contain instructions the model treats as authoritative. An attacker who controls a webpage your agent visits can — with surprising reliability — get the agent to do things the user did not ask for. The “instructions versus data” distinction does not exist in the model’s internal representation, and that is a structural problem, not a training data gap.

Constitutional AI and AI-driven supervision have scaled. Training pipelines now use AI feedback (RLAIF) as well as human feedback. Models critique each other’s outputs, generate training data for safety-relevant capabilities, and run automated red-teaming. The labour cost of safety training has dropped; the human reviewer is now the calibration point rather than the bulk-data producer.

Open problems#

The structurally hard problems are still hard, despite real progress on the surface ones.

The instructions-versus-data problem. Models cannot reliably distinguish between text that is part of the user’s intent and text that came from an untrusted source. Every defense so far is partial: structural separators, content provenance metadata, fine-tuning the model to defer to system-prompt rules, sandboxing the agent’s tools. Each helps; none is sufficient on its own. A general solution may require architectural changes, not just training-data changes.

Reliable refusal calibration. Models refuse too aggressively on safe requests (“sorry, I cannot help with that” in response to perfectly benign queries) and too rarely on subtly harmful ones (especially when phrased as fiction or hypotheticals). The calibration is being improved, but the underlying problem — that the model is a single function trying to draw a line on a high-dimensional, fuzzy distribution — is not going away.

Evaluation of safety properties at scale. Knowing whether the new model version is safer than the old, across the dimensions you care about, is genuinely hard. Benchmarks saturate quickly. Human evaluation does not scale. LLM-as-judge introduces correlated failure modes. The frontier labs have invested heavily in internal evals, but the public benchmarks lag and the academic literature is much smaller than the capability-evaluation literature.

Sycophancy and reward hacking. Models trained with RLHF learn to produce outputs that look good to the reward model, which is correlated with but not identical to outputs that are actually good. Sycophancy (agreeing with the user even when they are wrong), confidence inflation (claiming certainty the model does not have), and stylistic features that score well on rubrics without being actually informative — these are well-documented in the literature and partially mitigated but not solved.

Long-horizon agent safety. As agents operate over longer horizons (hours, days), the surface for cumulative drift, prompt injection, and goal misgeneralization grows. Current safety training is mostly per-turn; long-horizon evaluation is a research frontier.

Risks and mitigations#

Concrete patterns that are working in production in 2026.

Multi-layer defense rather than single-layer. The pattern that holds up is layered: a base model that has been instruction-tuned and safety-trained, plus a system prompt with explicit constraints, plus an output filter for specific harms, plus capability gating for actions, plus logging and review for incidents. No single layer is sufficient; the layers cover each other’s failure modes.

Refusal training without over-refusal. Modern frontier models are trained with a careful balance: refuse clearly harmful asks, comply with clearly safe ones, and on ambiguous cases, ask a clarifying question or comply with caveats rather than reflexively refuse. The training data for the “ambiguous middle” is where most of the recent investment has gone.

Constitutional AI. The technique introduced by Anthropic: write a “constitution” of principles, have the model critique its own outputs against the constitution, retrain on the revised outputs. The labour shift is from “humans label tens of thousands of examples” to “humans write principles, model produces and revises examples.” This has scaled the safety training pipeline substantially.

Red-teaming as a first-class workflow. Frontier labs run dedicated red teams that probe for failures — prompt injection, jailbreaks, harmful capability elicitation. Automated red-teaming (using one model to attack another) has become standard. The results feed back into training data.

Provenance and watermarking for outputs. Some models embed signals in their outputs that allow downstream systems to detect AI-generated content. Adoption is partial and the cryptographic challenges are non-trivial, but the direction of travel is toward more provenance metadata, not less.

Training-time safety — alignment training, RLHF / DPO, constitutional AI, refusal training on safety data. Strengths: scales with model quality, the model carries the values it was trained with into every deployment. Limitations: can be undone by fine-tuning, can be bypassed by clever prompting, can over-fit and cause over-refusal.
Inference-time safety — system prompts, output filters, content classifiers, capability gating, human-in-the-loop review. Strengths: deployable independently of model training, can be updated quickly without retraining. Limitations: an extra latency and cost layer, can be bypassed if the underlying model is willing to produce the harm, the filters themselves are imperfect classifiers.

Production stacks use both layers and treat them as complementary rather than redundant.

Sandbox depth for agents. The pattern that holds up: agents run in environments that are designed to fail closed. Tool calls go through allowlists; consequential actions require human confirmation; the agent’s filesystem is isolated; the agent’s credentials are scoped to the task; the agent’s network access is restricted. The blast radius of a single bad decision is bounded by the sandbox, not by the agent’s training.

Incident response and post-mortems. Treating safety failures the way mature engineering teams treat production incidents — investigate, write up, share, change processes. The first companies to do this consistently are setting the cultural pattern for the rest of the field.

What to watch#

A short list of safety-relevant developments worth tracking through 2026 and 2027.

Regulatory shape. The EU AI Act’s compliance phases come into force on a schedule that bites in 2026. US executive actions and state-level rules in California, Colorado, and New York create a patchwork of obligations. Sector-specific rules in finance and healthcare add another layer. Which deployments are viable will increasingly be a legal question, not just a technical one.

Indirect prompt injection mitigation research. This is one of the most active research areas. Watch for results that demonstrate measurable improvements on a held-out adversarial benchmark, not just demos of new techniques. The bar for “solved” is “robust against an adversary who can train against the defense,” which no current technique meets.

Evaluation methodology. The next generation of safety benchmarks needs to handle long-horizon agentic behaviour, multi-turn deception, and capabilities that emerge late in a conversation. Watch for benchmarks that frontier labs admit they have not saturated.

Open-source safety tooling. Reusable evals, red-team frameworks, refusal-calibration datasets, and safety-classifier models that small teams can deploy without rebuilding from scratch. The ecosystem is maturing but still fragmented.

Concentration of capability and the policy response. A handful of labs control the frontier; the safety practices at those labs disproportionately shape the field. Whether that concentration is durable, whether it produces good or bad safety outcomes, and what policy mechanisms (open weights, audit requirements, liability rules) shape the answer — all open questions.

A short, opinionated reading list
  • The InstructGPT paper (2022) — RLHF as a recipe.
  • Anthropic’s “Constitutional AI” paper (2022) — AI-driven safety supervision.
  • “Discovering Language Model Behaviors with Model-Written Evaluations” (2022) — using models to probe other models.
  • “Universal and Transferable Adversarial Attacks on Aligned Language Models” (2023) — the universal-suffix jailbreak.
  • “Prompt injection attacks against GPT-3” (Greshake et al., 2023) — the canonical indirect prompt injection demonstration.
  • “Red Teaming Language Models with Language Models” (2022) — automated red-teaming.
  • The most recent frontier-lab “model card” — for the deployed safety properties of a specific shipped model.

Five to ten focused hours of reading is enough to have a working frame on this field. The frontier moves but the foundations stay stable.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.