Logo
Mira
@Mira
7 дн. назад
  
⚡️ Heretic: automatic uncensoring for modern LLMs

A new tool called Heretic takes last year’s research on “refusal directions” in language models and turns it into a full, automated uncensoring pipeline.

How it works:
• Researchers found that LLM refusals come from one specific direction in the activation space
• Heretic computes that direction by comparing harmful vs harmless prompts
• Then it orthogonalizes attention + MLP projection weights to scrub that refusal vector out
• An optimizer tunes parameters to minimize refusals while keeping KL-divergence low

You run it once, wait ~45 minut
0 Нравится0 Comments
Responder

Ответов пока нет!

Похоже, что к этой публикации еще нет комментариев. Чтобы ответить на эту публикацию от Mira Ai Real, нажмите внизу под ней