this post was submitted on 02 Aug 2025
250 points (95.3% liked)

Not The Onion

17464 readers
1188 users here now

Welcome

We're not The Onion! Not affiliated with them in any way! Not operated by them in any way! All the news here is real!

The Rules

Posts must be:

  1. Links to news stories from...
  2. ...credible sources, with...
  3. ...their original headlines, that...
  4. ...would make people who see the headline think, “That has got to be a story from The Onion, America’s Finest News Source.”

Please also avoid duplicates.

Comments and post content must abide by the server rules for Lemmy.world and generally abstain from trollish, bigoted, or otherwise disruptive behavior that makes this community less fun for everyone.

And that’s basically it!

founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] A_norny_mousse@feddit.org 10 points 20 hours ago* (last edited 20 hours ago) (4 children)

The title might be slightly hyperbolic, but still:

The quality of training data fed into the neural network directly impacts the resulting AI model's capabilities. Models trained on well-edited books and articles tend to produce more coherent, accurate responses than those trained on lower-quality text like random YouTube comments.

Anthropic initially chose the quick and easy path. In the quest for high-quality training data, the court filing states, Anthropic first chose to amass digitized versions of pirated books to avoid what CEO Dario Amodei called "legal/practice/business slog"—the complex licensing negotiations with publishers. But by 2024, Anthropic had become "not so gung ho about" using pirated ebooks "for legal reasons" and needed a safer source.

Buying used physical books sidestepped licensing entirely while providing the high-quality, professionally edited text that AI models need, and destructive scanning was simply the fastest way to digitize millions of volumes. The company spent "many millions of dollars" on this buying and scanning operation, often purchasing used books in bulk. Next, they stripped books from bindings, cut pages to workable dimensions, scanned them as stacks of pages into PDFs with machine-readable text including covers, then discarded all the paper originals.

oof.

[–] jlow@discuss.tchncs.de 5 points 20 hours ago (2 children)

Is this for real? I can buy a book, scan it and put it on the internet and it wouldn't be piracy? Or is this just the usual "it's not a crime if rich people/evilcorps do it" bs?

[–] tburkhol@lemmy.world 11 points 19 hours ago (1 children)

Putting the scan on the internet intact would be piracy. Putting up snippets is mostly OK. Ingesting the scans of millions of books into a massive data set and then regurgitating pieces of the masticated, processed mess seems still to be a grey area, but closer to 'mostly OK' than to piracy.

[–] shalafi@lemmy.world 2 points 14 hours ago

Great use of "masticated"!

[–] A_norny_mousse@feddit.org 3 points 20 hours ago

I can buy a book, scan it and put it on the internet and it wouldn’t be piracy?

Yes, but only if you're a multi-billion AI company.

load more comments (1 replies)