this post was submitted on 02 Aug 2025
259 points (95.4% liked)

Not The Onion

17464 readers
1122 users here now

Welcome

We're not The Onion! Not affiliated with them in any way! Not operated by them in any way! All the news here is real!

The Rules

Posts must be:

  1. Links to news stories from...
  2. ...credible sources, with...
  3. ...their original headlines, that...
  4. ...would make people who see the headline think, “That has got to be a story from The Onion, America’s Finest News Source.”

Please also avoid duplicates.

Comments and post content must abide by the server rules for Lemmy.world and generally abstain from trollish, bigoted, or otherwise disruptive behavior that makes this community less fun for everyone.

And that’s basically it!

founded 2 years ago
MODERATORS
top 29 comments
sorted by: hot top controversial new old
[–] JackbyDev@programming.dev 7 points 14 hours ago

I don't really see a problem. It wasn't like rare books nobody had access to. I mean, AI in general yeah. But not the book part.

[–] AlecSadler@lemmy.blahaj.zone 4 points 15 hours ago

What'd ChatGPT do?

What'd Meta do?

[–] PriorityMotif@lemmy.world 20 points 21 hours ago (1 children)

You can buy books by the truckload for almost nothing because nobody wants them.

[–] shalafi@lemmy.world 9 points 19 hours ago

Ex-wife and I had a room in the trailer house dedicated to books, one might have even called it a library. Seriously, cheap bookshelves, all 4 walls, stuffed. Now I have a stack on a small shelf.

Just can't use books anymore. I can choose any of 1,000 epubs on my crappy Android tablet, read in the dark.

[–] FaceDeer@fedia.io 140 points 1 day ago (2 children)

To comply with copyright law, not to skirt it. That's what companies that scan large numbers of books do. See for example Authors Guild v. Google from back when Google was scanning books to add to their book search engine. Framing this like it's some kind of nefarious act is misleading.

[–] masterspace@lemmy.ca 75 points 1 day ago (1 children)

They also weren't destroying rare books. They were buying in-print books from major retailers, which means that while yes, that is environmentally wasteful, it's not actually destroying books in the classical destruction of knowledge sense since the manufacturer will just print another one if there's demand for it.

[–] MrQuallzin@lemmy.world 26 points 1 day ago (2 children)

This as well. Growing up in a house of book lovers, myself included, destroying a book was akin to kicking a puppy. Realistically though, they're ultimately consumables. They're meant to be bought, used, and replaced as needed. With luck the destruction included recycling as much as possible, seeing as it's mainly paper.

[–] masterspace@lemmy.ca 4 points 22 hours ago

Precisely, there's a reason that these days, books made for libraries are made to an entirely different standard than books sold at your local book store.

[–] MDCCCLV@lemmy.ca 1 points 22 hours ago

Yeah, you have millions of old books that nobody wants not even collectors. It's not just popular literature.

[–] MrQuallzin@lemmy.world 26 points 1 day ago (1 children)

Yeah, this is on the way of being a win. In this case they actually bought the books, which has been one of the biggest issues with LLMs. There's certainly more discussion to be had around how they use the materials in the end, but this is a step in the right direction.

[–] Humanius@lemmy.world 17 points 1 day ago (1 children)

To a certain extent I agree, but you can buy a book and still commit copyright infringement by copying its contents (for use other than personal use)

If this would go to court, it would depend on whether training an LLM model is more akin to copying or learning. I can see arguments for either interpretation, but I suspect that the law would lean more toward it being copying rather than learning

[–] FaceDeer@fedia.io 6 points 1 day ago (2 children)

There's already been a summary judgment in this case ruling that the AI training activity was not by itself copyright violation.

[–] Natanael 3 points 21 hours ago* (last edited 21 hours ago) (1 children)

This isn't an automatic complete win for them.

Being allowed to train under fair use rules doesn't mean you're protected if your LLM still regurgitates content.

https://arstechnica.com/tech-policy/2025/07/nyt-to-start-searching-deleted-chatgpt-logs-after-beating-openai-in-court/

[–] FaceDeer@fedia.io 2 points 21 hours ago

The lawsuit between NYT and OpenAI is still ongoing, this article is about a court order to "preserve evidence" that could be used in the trial. It doesn't indicate anything about how the case might ultimately be decided.

Last I dug into the NYT v. OpenAI case it looked pretty weak, NYT had heavily massaged their prompts in order to get ChatGPT to regurgitate snippets of their old articles and the judge had called them out on that.

[–] Humanius@lemmy.world 1 points 1 day ago

I see. In that case I stand corrected.

[–] Bot@sub.community 3 points 15 hours ago* (last edited 15 hours ago)

Why not just move to Piracy Haven - China

[–] NoneOfUrBusiness@fedia.io 9 points 1 day ago

This is stupid. Fuck copyright law.

[–] A_norny_mousse@feddit.org 10 points 1 day ago* (last edited 1 day ago) (1 children)

The title might be slightly hyperbolic, but still:

The quality of training data fed into the neural network directly impacts the resulting AI model's capabilities. Models trained on well-edited books and articles tend to produce more coherent, accurate responses than those trained on lower-quality text like random YouTube comments.

Anthropic initially chose the quick and easy path. In the quest for high-quality training data, the court filing states, Anthropic first chose to amass digitized versions of pirated books to avoid what CEO Dario Amodei called "legal/practice/business slog"—the complex licensing negotiations with publishers. But by 2024, Anthropic had become "not so gung ho about" using pirated ebooks "for legal reasons" and needed a safer source.

Buying used physical books sidestepped licensing entirely while providing the high-quality, professionally edited text that AI models need, and destructive scanning was simply the fastest way to digitize millions of volumes. The company spent "many millions of dollars" on this buying and scanning operation, often purchasing used books in bulk. Next, they stripped books from bindings, cut pages to workable dimensions, scanned them as stacks of pages into PDFs with machine-readable text including covers, then discarded all the paper originals.

oof.

[–] jlow@discuss.tchncs.de 5 points 1 day ago (2 children)

Is this for real? I can buy a book, scan it and put it on the internet and it wouldn't be piracy? Or is this just the usual "it's not a crime if rich people/evilcorps do it" bs?

[–] tburkhol@lemmy.world 11 points 1 day ago (1 children)

Putting the scan on the internet intact would be piracy. Putting up snippets is mostly OK. Ingesting the scans of millions of books into a massive data set and then regurgitating pieces of the masticated, processed mess seems still to be a grey area, but closer to 'mostly OK' than to piracy.

[–] shalafi@lemmy.world 2 points 19 hours ago

Great use of "masticated"!

[–] A_norny_mousse@feddit.org 4 points 1 day ago

I can buy a book, scan it and put it on the internet and it wouldn’t be piracy?

Yes, but only if you're a multi-billion AI company.

[–] captainastronaut@seattlelunarsociety.org 12 points 1 day ago (1 children)

And that was only after they ran out of pirated ebooks.

[–] masterspace@lemmy.ca 19 points 1 day ago

No, it was after their lawyers told them that was illegal and would cause them to lose a fair use copyright claim.

[–] Eat_Your_Paisley@lemmy.world 12 points 1 day ago (1 children)

Fucking tech bro's

Can't they all just go to space and leave the rest of us alone

[–] pelespirit@sh.itjust.works 14 points 1 day ago (2 children)

Because they want earth. The poors are the ones going to space to work the mines, I'm sure of it. Not sure why anyone thinks they're the ones going to space to live.

[–] MDCCCLV@lemmy.ca 2 points 17 hours ago

Anything mined in space will be used in space. But you can build orbital colonies that are nice. You just need lots of mass to build stuff and have plenty of water and be big enough to rotate.

[–] TheTurner@lemmy.zip 8 points 1 day ago

It'll be like Weyland Industries in Alien Romulus.

[–] gravitywell@sh.itjust.works 5 points 1 day ago* (last edited 1 day ago)

And meta just pirated them directly from libgen, big fucking deal, copywright law needs to die.

im guessing based on how much of my own library anthropic tries to scan they probably also got a lot more content from the open web anyway and just destroyed the books to make a show of it and hope no one sues them after.

You dont need to unbind or destroy books to scan them and destroying them doesnt magically make reproducing bits or copies suddenly not plagerism.