AI Alignment

Semantic Governance vs
Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) aligns AI through human preferences on outputs. Semantic governance asks: what if intent was explicit and inspectable from the start?

What RLHF Got Right

Before we compare, let's acknowledge the breakthrough.

RLHF proved that AI systems could be steered toward human preferences without explicitly programming every rule. The breakthrough, demonstrated in OpenAI's InstructGPT paper (2022), was using preference comparisons—"I prefer response A over B"—to train a reward model that shapes model behavior.

This enabled the current generation of helpful, harmless, and honest AI assistants. RLHF showed that alignment was tractable at scale. Before RLHF, the dominant paradigm was either rule-based systems (brittle) or pure supervised learning (limited by demonstration data).

The key insight: Humans are better at comparing outputs than specifying what they want. RLHF leverages this asymmetry—we can recognize good behavior even when we can't describe it in advance.

The Structural Gap Semantic Governance Addresses

RLHF optimizes for preferences. Semantic governance makes intent explicit.

RLHF

Collect human preferences between outputs, train a reward model to predict preferences, then fine-tune the policy to maximize predicted reward.

Alignment Flow

PreferencesReward ModelPolicy
Proven at scale (GPT-4, Claude)
Works with implicit values
Captures nuanced preferences
Values remain opaque
Can't trace why a decision was made
Preference conflicts averaged away

Semantic Governance

Express intent as explicit semantic artifacts. Trace how intent flows through delegation. Audit whether actions align with declared purposes.

Alignment Flow

Intent ArtifactSemantic LayerTraceable Action
Intent is inspectable
Decisions are traceable
Conflicts surfaced, not averaged
Governance can be audited
Intent survives delegation

An Evolution, Not a Replacement

Each approach solved a real problem. Semantic governance addresses gaps the others couldn't.

1

Rule-Based Systems

Hardcode rules → AI follows exactly → Breaks on edge cases

Limitation: Can't anticipate every situation

2

RLHF

Collect preferences → Train reward model → Optimize policy

Limitation: Values are implicit—we can't inspect what was learned

3

Semantic Governance

Specify intent → Encode as artifact → Trace execution

Limitation: Requires explicit intent architecture upfront

The Core Challenge

The Interpretability Gap

RLHF works—but we can't explain why it works in any particular case.

Why Implicit Values Create Problems

RLHF encodes values implicitly in model weights through preference optimization. When the model makes a decision, we can observe the output but cannot inspect the reasoning chain that led there. This creates accountability gaps:

Preference Washing

Conflicting values are averaged into a single policy, obscuring trade-offs that stakeholders should decide explicitly.

Drift Blindness

As models are fine-tuned over time, original intentions can shift without anyone detecting the change.

Audit Failure

Regulators cannot verify that claimed values are actually encoded in the model's behavior.

The Alignment Tax Debate

A key finding from RLHF research is the "alignment tax"—the capability cost of making models behave well. OpenAI's InstructGPT paper showed that RLHF models sometimes performed worse on pure capability benchmarks while being more helpful.

This creates a tension: organizations optimizing for capability may underinvest in alignment. And because RLHF values are implicit, we can't easily verify whether the "tax" is being paid.

Semantic governance reframes this:

Instead of a capability-alignment trade-off that's invisible in the weights, make the trade-off explicit in intent specifications. Then stakeholders can decide—transparently—how much capability they're willing to sacrifice for what kind of alignment.

Semantic Governance's Approach

Instead of hoping values emerge from preference data, semantic governance makes intent a first-class object:

Explicit Artifacts

Intent encoded as inspectable objects with provenance

Provenance Chains

Trace any decision back to its source intent

Conflict Surfacing

Value tensions made visible, not hidden in averages

Feature Comparison

FeatureRLHFSemantic Gov
Core MechanismHuman feedback on outputsExplicit intent specification
Value RepresentationImplicit in preference dataExplicit, inspectable artifacts
AuditabilityBlack box preferencesTransparent intent chains
ScalabilityExpensive human annotationReusable intent schemas
InterpretabilityEmergent, hard to explainDesigned, traceable
Conflict HandlingAveraged in trainingSurfaced explicitly
AdaptabilityRequires retrainingRuntime updates
Implementation MaturityWidely deployedEmerging framework

The Real Choice

These aren't competing approaches. They serve different purposes and can work together.

Use RLHF When:

  • Values are hard to specify but easy to recognize
  • You have access to quality human preference data
  • Nuance matters more than traceability
  • You're building general-purpose assistants

Use Semantic Governance When:

  • Accountability and auditability are required
  • Multiple stakeholders have different intents
  • Regulatory compliance demands traceability
  • Intent needs to survive organizational changes

The Hybrid View

In practice, these approaches can layer. Use semantic governance to define high-level organizational intent and constraints. Apply RLHF within those constraints to capture nuanced preferences. The semantic layer provides auditability; the RLHF layer provides adaptability. Together, they offer more robust alignment than either alone.

The Theoretical Foundation

IRSA's work on semantic governance draws on research in value alignment, interpretability, and institutional design. The core insight is that alignment problems often stem from a failure to represent intent explicitly—whether in AI systems, organizations, or capital structures. Making intent a first-class object creates new possibilities for accountability and adaptation.

Explore AI & Governance Explainers

Explore AI Governance

Learn more about how semantic governance addresses alignment challenges across AI systems and institutions.