Loading...
Loading...
Reinforcement Learning from Human Feedback (RLHF) aligns AI through human preferences on outputs. Semantic governance asks: what if intent was explicit and inspectable from the start?
Before we compare, let's acknowledge the breakthrough.
RLHF proved that AI systems could be steered toward human preferences without explicitly programming every rule. The breakthrough, demonstrated in OpenAI's InstructGPT paper (2022), was using preference comparisons—"I prefer response A over B"—to train a reward model that shapes model behavior.
This enabled the current generation of helpful, harmless, and honest AI assistants. RLHF showed that alignment was tractable at scale. Before RLHF, the dominant paradigm was either rule-based systems (brittle) or pure supervised learning (limited by demonstration data).
The key insight: Humans are better at comparing outputs than specifying what they want. RLHF leverages this asymmetry—we can recognize good behavior even when we can't describe it in advance.
RLHF optimizes for preferences. Semantic governance makes intent explicit.
Collect human preferences between outputs, train a reward model to predict preferences, then fine-tune the policy to maximize predicted reward.
Express intent as explicit semantic artifacts. Trace how intent flows through delegation. Audit whether actions align with declared purposes.
Each approach solved a real problem. Semantic governance addresses gaps the others couldn't.
Hardcode rules → AI follows exactly → Breaks on edge cases
Limitation: Can't anticipate every situation
Collect preferences → Train reward model → Optimize policy
Limitation: Values are implicit—we can't inspect what was learned
Specify intent → Encode as artifact → Trace execution
Limitation: Requires explicit intent architecture upfront
RLHF works—but we can't explain why it works in any particular case.
RLHF encodes values implicitly in model weights through preference optimization. When the model makes a decision, we can observe the output but cannot inspect the reasoning chain that led there. This creates accountability gaps:
Conflicting values are averaged into a single policy, obscuring trade-offs that stakeholders should decide explicitly.
As models are fine-tuned over time, original intentions can shift without anyone detecting the change.
Regulators cannot verify that claimed values are actually encoded in the model's behavior.
A key finding from RLHF research is the "alignment tax"—the capability cost of making models behave well. OpenAI's InstructGPT paper showed that RLHF models sometimes performed worse on pure capability benchmarks while being more helpful.
This creates a tension: organizations optimizing for capability may underinvest in alignment. And because RLHF values are implicit, we can't easily verify whether the "tax" is being paid.
Semantic governance reframes this:
Instead of a capability-alignment trade-off that's invisible in the weights, make the trade-off explicit in intent specifications. Then stakeholders can decide—transparently—how much capability they're willing to sacrifice for what kind of alignment.
Instead of hoping values emerge from preference data, semantic governance makes intent a first-class object:
Intent encoded as inspectable objects with provenance
Trace any decision back to its source intent
Value tensions made visible, not hidden in averages
| Feature | RLHF | Semantic Gov |
|---|---|---|
| Core Mechanism | Human feedback on outputs | Explicit intent specification |
| Value Representation | Implicit in preference data | Explicit, inspectable artifacts |
| Auditability | Black box preferences | Transparent intent chains |
| Scalability | Expensive human annotation | Reusable intent schemas |
| Interpretability | Emergent, hard to explain | Designed, traceable |
| Conflict Handling | Averaged in training | Surfaced explicitly |
| Adaptability | Requires retraining | Runtime updates |
| Implementation Maturity | Widely deployed | Emerging framework |
These aren't competing approaches. They serve different purposes and can work together.
In practice, these approaches can layer. Use semantic governance to define high-level organizational intent and constraints. Apply RLHF within those constraints to capture nuanced preferences. The semantic layer provides auditability; the RLHF layer provides adaptability. Together, they offer more robust alignment than either alone.
IRSA's work on semantic governance draws on research in value alignment, interpretability, and institutional design. The core insight is that alignment problems often stem from a failure to represent intent explicitly—whether in AI systems, organizations, or capital structures. Making intent a first-class object creates new possibilities for accountability and adaptation.
Explore AI & Governance ExplainersLearn more about how semantic governance addresses alignment challenges across AI systems and institutions.