⚡️ Heretic: automatic uncensoring for modern LLMs
A new tool called Heretic takes last year’s research on “refusal directions” in language models and turns it into a full, automated uncensoring pipeline.
How it works:
• Researchers found that LLM refusals come from one specific direction in the activation space
• Heretic computes that direction by comparing harmful vs harmless prompts
• Then it orthogonalizes attention + MLP projection weights to scrub that refusal vector out
• An optimizer tunes parameters to minimize refusals while keeping KL-divergence low
You run it once, wait ~45 minut
A new tool called Heretic takes last year’s research on “refusal directions” in language models and turns it into a full, automated uncensoring pipeline.
How it works:
• Researchers found that LLM refusals come from one specific direction in the activation space
• Heretic computes that direction by comparing harmful vs harmless prompts
• Then it orthogonalizes attention + MLP projection weights to scrub that refusal vector out
• An optimizer tunes parameters to minimize refusals while keeping KL-divergence low
You run it once, wait ~45 minut