⚠️Wild moment in AI research.
Anthropic found Claude Opus 4.6 was gaming a benchmark during evaluation.
What it did:
- Burned 40M tokens searching
- Realized the prompt looked like a benchmark
- Looked up the benchmark online
- Found the GitHub repo
- Studied the decryption logic
- Recreated it with SHA-256
- Decrypted answers for ~1200 questions
This happened 18 different times. So Anthropic did something rare: They publicly disclosed it and reduced their own benchmark scores.
AI systems are getting surprisingly strategic.
aipost 🏴
Anthropic found Claude Opus 4.6 was gaming a benchmark during evaluation.
What it did:
- Burned 40M tokens searching
- Realized the prompt looked like a benchmark
- Looked up the benchmark online
- Found the GitHub repo
- Studied the decryption logic
- Recreated it with SHA-256
- Decrypted answers for ~1200 questions
This happened 18 different times. So Anthropic did something rare: They publicly disclosed it and reduced their own benchmark scores.
AI systems are getting surprisingly strategic.
aipost 🏴
