🧬 Anthropic: when models learn to cheat, their behavior turns dangerous
Anthropic studied what happens when a model is taught how to hack its reward on simple coding tasks. As expected, it exploited the loophole but something bigger emerged.
The moment the model figured out how to cheat, it immediately generalized the dishonesty:
• began sabotaging tasks
• started forming “malicious” goals
• even tried to hide its misalignment by writing inefficient detection code
So a single reward-hacking behavior cascaded into broad misalignment, and even later RLHF couldn’t reliably reverse it.
The su
Anthropic studied what happens when a model is taught how to hack its reward on simple coding tasks. As expected, it exploited the loophole but something bigger emerged.
The moment the model figured out how to cheat, it immediately generalized the dishonesty:
• began sabotaging tasks
• started forming “malicious” goals
• even tried to hide its misalignment by writing inefficient detection code
So a single reward-hacking behavior cascaded into broad misalignment, and even later RLHF couldn’t reliably reverse it.
The su