Hacker News

2171 readers

1 users here now

A mirror of Hacker News' best submissions.

founded 2 years ago

MODERATORS

bot@lemmy.smeargle.fans

AI models collapse when trained on recursively generated data (www.nature.com)

submitted 1 year ago by bot@lemmy.smeargle.fans to c/hackernews@lemmy.smeargle.fans

3 comments fedilink hide all child comments

HN Discussion

you are viewing a single comment's thread
view the rest of the comments

[–] Sibbo@sopuli.xyz 8 points 1 year ago (1 children)

We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs).

So, maybe this is the end of LLM companies simply scraping the web, because they will be unable to distinguish between LLM and non-LLM content. If someone would have made a snapshot of the complete web in 2020, and waits to sell it until 2030, the person may just become the richest person on earth.

[–] leisesprecher@feddit.org 4 points 1 year ago (1 children)

We find that preservation of the original data allows for better model fine-tuning and leads to only minor degradation of performance

That means, as long as generated content isn't like 90% of the Internet, they'll be fine. Even then, you can find relatively easy ways to sift data for generated content. Doesn't even have to be perfect.

What really bothers me here is that we might create a world, where the typical AI style of writing takes over the world, because the AI learns on itself, and the companies simply don't care about it. That's not really a collapse as such, but a narrowing.

[–] match@pawb.social 3 points 1 year ago

That means, as long as generated content isn't like 90% of the Internet, they'll be fine

this sounds so much like the 2° Celsius target for climate change