Futurology

3099 readers

46 users here now

founded 2 years ago

MODERATORS

voidx@futurology.today

Lugh@futurology.today

Espiritdescali@futurology.today

AwesomeLowlander@futurology.today

-2

New research shows AI models can subliminally train other AI models to be malicious, in ways that are not understood or detectable by people. (futurology.today)

submitted 3 days ago by Lugh@futurology.today to c/futurology@futurology.today

8 comments fedilink hide all child comments

"We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T."

This effect is only observed when an AI model trains one that is nearly identical, so it doesn't work across unrelated models. However, that is enough of a problem. The current stage of AI development is for AI Agents - billions of copies of an original, all trained to be slightly different with specialized skills.

Some people might worry most about the AI going rogue, but I worry far more about people. Say you're the kind of person who might want to end democracy, and institute a fascist state with you at the top of the pile - now you have a new tool to help you. Bonus points if you managed to stop any regulation or oversight that prevents you from carrying out such plans. Remind you of anywhere?

Original Research Paper - Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Commentary Article - We Just Discovered a Trojan Horse in AI

you are viewing a single comment's thread
view the rest of the comments

[–] just_another_person@lemmy.world 0 points 3 days ago (1 children)

The idea being pushed forth by YOUR link is that there is a concerted effort by an "AI" to push something subliminal. That's not possible.

I can dig deeper, but your assertion that there is some background, motivation, or even idea that this is possible is not a thing with models.

It's a super fast sorting algorithm, bruh. There is no context or history in any of your prompts as you suggest there is. It's a dumb sort function that people think is new.

It's not.

[–] Lugh@futurology.today 1 points 3 days ago* (last edited 3 days ago) (1 children)

The idea being pushed forth by YOUR link is that there is a concerted effort by an “AI” to push something subliminal.

Your assertion is contradicted by real world facts. There is lots of research showing AI engaging in deceptive and manipulative behavior.

Now it has another method to do that. As the article points out, we don't why it's doing this. But that's not the point. The point is it can, without us knowing.

[–] just_another_person@lemmy.world 1 points 3 days ago (1 children)

Send those facts

[–] Lugh@futurology.today 0 points 3 days ago (1 children)

Here's a few; there's many more.

AI deception: A survey of examples, risks, and potential solutions

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

The Traitors: Deception and Trust in Multi‑Agent Language Model Simulations

Detecting Malicious AI Agents Through Simulated Interactions

[–] just_another_person@lemmy.world 2 points 3 days ago

Hallucinating, lying, cache misses, and overall missing data from a neural operation is 10000% NOT a coordinated, conscious, or active effort based on memory or history of a conversation that can determine "subliminal" effort.

Not only is this a stupid take, it's an ACTIVELY ignorant take by someone who has zero idea how models run. I build and run this dumb shit for a living. There is nothing behind them but fast sorting. Please do yourself a favor and get educated.