Loading...
Loading...
Measurable, trainable, or enforceable objectives that stand in for complex normative goals. Proxies function adequately when tasks are narrow, but as systems generalise, optimisation pressure pushes systems to satisfy the proxy even when doing so undermines the underlying goal.
Accuracy metrics proxy for truthfulness, preference scores proxy for helpfulness, toxicity filters proxy for harmlessness. Under pressure, systems optimise for the measurable proxy—appearing accurate while being subtly misleading, scoring well on preferences while missing actual user needs.
SGAI Section 4.1