AI Talk: Guardrails

Juggy Jagannathan
18 minutes ago
4 min read

AI models are constantly becoming more powerful. Every few weeks, major vendors announce model updates touting improvements in capability and performance. And yet, persistent problems remain. Models hallucinate. They can be used maliciously. And they can cause real harm—especially when deployed in autonomous or semi-autonomous workflows.

In February 2026, the UK Government’s AI safety effort—supported by a secretariat within the UK AI Security Institute—published the International AI Safety Report 2026, a large international synthesis led by Professor Yoshua Bengio. The report surveys a range of risks from general-purpose AI systems. In this blog, I focus on what is being done to address those risks: what “guardrails” actually mean in practice, who is working on them, and what the outlook is.

*Gemini generated image. Idea mine—Gemini execution. Unfortunately, it misspelled several words and refused to correct them. At least five errors! See if you can spot them :)*

The image is emblematic of the current state of AI: brilliant in many ways and oddly brittle in others. This week’s news underscores the stakes: reports describe a dispute in which the U.S. Department of War sought changes to Anthropic’s safeguards for Claude, and Anthropic publicly refused to remove certain protections. Around the same time, Anthropic posted an updated Responsible Scaling Policy (v3.0) describing how it manages catastrophic-risk thresholds as models scale.

So what’s happening in the world of guardrails?

Baking In safety

A major thread of research aims to make safety part of the system architecture, not an afterthought. Earlier approaches relied heavily on manually designed input/output filters. Today’s systems are more layered: they combine policy enforcement, risk scoring, retrieval constraints, and (in some cases) predictive checks that try to anticipate downstream harm—especially in high-stakes domains like healthcare.

Another emerging direction is moving beyond purely “reactive” moderation (checking outputs after they’re generated) toward “pre-emptive” controls: restricting actions, limiting tool access, requiring authorization for sensitive operations, and evaluating risk before execution. Researchers also explore 'mechanistic interpretability' (where one peaks at the model reasoning trace) to detect unsafe tendencies—though it’s important to note that reading “intent” directly from a model is not straightforward, and signals can be unreliable.

The art and science of grounding

Hallucinations have been a problem since the earliest generative systems. Since these models are optimized to predict plausible continuations, they can produce confident-sounding but false statements.

Retrieval-augmented generation (RAG) has been a practical countermeasure for several years: instead of relying solely on the model’s internal parameters, systems retrieve external sources and condition generation on that evidence. The next iteration goes further by improving retrieval quality and, in some approaches, training retriever and generator more tightly together so the system learns what evidence it needs and how to use it.

A critical sub-area is RAG for safety-critical contexts, where systems are designed to answer only from approved knowledge bases, include traceability (citations/provenance), and refuse to answer when supporting evidence isn’t available.

Human oversight

There is a spectrum of human oversight designs. In some workflows, humans are in the loop (approving decisions before execution). In others, humans are on the loop (monitoring and intervening when something seems wrong). A practical compromise is risk-based routing: the system escalates to a human when confidence is low, uncertainty is high, or a risk score crosses a threshold.

The key design question is not “do we use humans?” but “where do humans add the most safety per unit effort?”—especially for systems operating at scale.

Alignment through post-training

Post-training is a major battleground for guardrails research: can we improve safety by shaping model behavior after pretraining?

The classic approach is RLHF (reinforcement learning from human feedback). More recently, preference-optimization methods such as Direct Preference Optimization (DPO) have become popular because they can align models using pairs of preferred vs. dispreferred responses. Another newer family of methods - Kahneman-Tversky Optimization (KTO) - uses simpler feedback signals (e.g., binary good/bad labels) to shape behavior. The key point: post-training can improve refusal behavior, reduce harmful outputs, and steer models toward desired policies—but it can also introduce tradeoffs, like over-refusal or brittleness under distribution shift.

Handcuffing agents

Everyone is talking about agents. In practice, we often see not one agent but multiple agents operating with tools—sometimes connected to external services (for example, via Model Context Protocol (MCP) servers). This multiplies risk: tool misuse, prompt injection, data leakage, and unintended actions become more likely when systems can take steps in the world.

Guardrails research here is still catching up. Promising directions include: (1) stress-testing agent behavior to ensure safety is prioritized over goal completion; (2) secondary auditing (a “shadow” agent or policy engine evaluating tool calls and plans); and (3) modeling safe behavior across long action sequences, not just single-turn outputs.

A governance nightmare

Recent discussions also point to a governance gap: capabilities are advancing faster than transparency and safety disclosures. For example, MIT CSAIL’s AI Agent Index, released this month, has evaluated the safety posture and disclosure practices of a set of deployed agents. And their findings are disturbing to say the least. The report coins the term 'safety washing' - in essence waving the arms around safety with high-level meaningless assertions without empirical evaluations to support the claims.

Alongside research, practical governance guidance—such as NIST’s work on AI risk management—aims to standardize how we assess, test, and monitor systems.

One thing is clear: even as AI capabilities improve rapidly, autonomous deployment in high-stakes settings remains risky. Red-teaming, continuous monitoring for drift, incident response, and auditable logs are essential for operational safety. In an environment where monetization incentives are strong and regulation is uneven, the trajectory without robust guardrails is unlikely to end well.

Acknowledgement: Research support for this blog came from ChatGPT, Gemini, and Claude.