Constitutional AI

IntermediateAI Safety

Last updated June 14, 2026

What is Constitutional AI in simple terms?

In simple terms, constitutional AI trains a model to follow written house rules by having it check and fix its own answers against them. Like an employee given a clear code of conduct who learns to self-correct against it.

What is Constitutional AI?

Constitutional AI is a training approach, developed by the company Anthropic, that aligns an AI model's behavior to a written set of principles — a "constitution" — by having the model critique and revise its own responses against those rules, reducing the amount of human labeling of harmful outputs needed.

Most AI assistants are made safer and more helpful partly by people: human reviewers read the model's responses and rate which are good and which are harmful, and the model is trained on those ratings. That works, but it's slow, costly, and exposes reviewers to a lot of unpleasant content. Constitutional AI, an approach developed by Anthropic (the company behind the Claude assistant), changes where the guidance comes from. Instead of relying on humans to label every harmful answer, it gives the model a written set of principles — a "constitution" — and trains the model to judge and improve its own responses against those principles. The human effort shifts from rating thousands of individual outputs to writing the rules the model then applies to itself.

The process runs in two broad stages. First, the model is shown its own responses and asked to critique them against the constitution — to spot where an answer breaks a principle — and then rewrite the answer to comply. Training on these self-corrected responses teaches the model to produce better answers in the first place. Second, the model generates pairs of responses and uses the constitution to judge which is better, and that AI-generated preference is used to refine it further — the same shape as the human-feedback training used elsewhere, but with the model's principle-guided judgment standing in for much of the human labeling. The constitution itself is a plain-language document of principles, and it can draw on sources such as widely recognized statements of human rights.

It's worth being clear about what this does and doesn't promise. Constitutional AI is a method for making a model's values more explicit, more consistent, and cheaper to instill than pure human labeling — and because the principles are written down, they can be read, debated, and revised, which is a genuine transparency benefit over rules buried implicitly in human ratings. But it is not a guarantee of safe behavior. The model can still misjudge, still be steered off course, and still produce harmful output; a written constitution makes the *intended* behavior legible, not infallible. It's best understood as one well-known approach to alignment — making a model behave as intended — rather than a solved version of it.

Real-world example of Constitutional AI

Imagine an AI assistant gets a question it shouldn't answer straight — say, a request that's worded innocently but is really asking how to do something harmful. Under a constitutional-AI approach, the model has been trained to do something an unguided model wouldn't: hold its own first draft up against its written principles. Its initial reply might have been dangerously helpful; the principle "do not assist with content that could cause harm" flags it; the model rewrites the reply to decline and explain why, or to steer toward a safe alternative. Crucially, no human had to sit and label that exact exchange as off-limits in advance — the model applied a general written rule to a specific new situation itself. That self-check against a stated rulebook, learned during training, is the everyday face of constitutional AI.

Related terms

Frequently asked questions about Constitutional AI

What is the difference between constitutional AI and reinforcement learning from human feedback?

Reinforcement learning from human feedback (RLHF) relies on people rating which model responses are better, and the model learns from those human preferences. Constitutional AI keeps a similar training shape but replaces much of the human rating with the model's own judgment against a written set of principles — the constitution. The headline difference is where the guidance comes from: humans labeling individual outputs versus a written rulebook the model applies to itself. Constitutional AI reduces the human labeling needed (especially of harmful content) and makes the intended values explicit and reviewable, rather than implicit in a pile of human ratings. **2. Mechanism — How does constitutional AI work?**

How does constitutional AI work?

It trains a model against a written list of principles in two stages. First, the model critiques its own responses for breaches of those principles and rewrites them to comply, and it's trained on the improved versions. Second, it generates competing responses and uses the principles to judge which is better, and that preference is used to refine it — the human rater in standard feedback training is largely replaced by the model's principle-guided judgment. The effort moves from labeling many outputs to authoring the rules the model then enforces on itself. **3. Application — What is constitutional AI used for?**

What is constitutional AI used for?

It's used to align AI assistants — to make them more helpful, honest, and harmless — while cutting the volume of human labeling, particularly of harmful content, that the process would otherwise demand. It's the approach most associated with Anthropic's Claude assistant. More broadly, it's an example of a wider goal in AI safety: making a model's guiding values explicit and written down, so they can be inspected and debated rather than hidden inside a mass of human ratings. It improves and clarifies behavior; it doesn't make a model immune from error.