this post was submitted on 13 Jun 2024
10 points (100.0% liked)

Hacker News

2171 readers
1 users here now

A mirror of Hacker News' best submissions.

founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] jet@hackertalks.com 3 points 1 year ago* (last edited 1 year ago)

refusal behavior is mediated by a specific direction in the model's residual stream. If we prevent the model from representing this direction, it loses its ability to refuse requests.

This makes sense, large language models are basically a book of mad libs, and the safety rails companies want to put on these released models, it's like a preamble and post-amble, you apply to the mad libs themselves. So if you're implementing your own mad lib engine, you simply don't apply the preamble and post-ample if you don't want to

At its core, a release model is static, it is not dynamic it is not changing with time, so if you want to you can nullify its self-censorship.