this post was submitted on 09 Mar 2026
649 points (99.2% liked)

Technology

83295 readers
5320 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

In order to help train its AI models, Meta (and others) have been using pirated versions of copyrighted books, without the consent of authors or publishers. The company behind Facebook and Instagram faces an ongoing class-action lawsuit brought by authors including Richard Kadrey, Sarah Silverman, and Christopher Golden, and one in which it has already scored a major (and surprising) victory: The Californian court concluded last year that using pirated books to train its Llama LLM did qualify as fair use.

You'd think this case would be as open-and-shut as it gets, but never underestimate an army of high-priced lawyers. Meta has now come up with the striking defense that uploading pirated books to strangers via BitTorrent qualifies as fair use. It further goes on to claim that this is double good, because it has helped establish the United States' leading position in the AI field.

Meta further argues that every author involved in the class-action has admitted they are unaware of any Llama LLM output that directly reproduces content from their books. It says if the authors cannot provide evidence of such infringing output or damage to sales, then this lawsuit is not about protecting their books but arguing against the training process itself (which the court has ruled is fair use).

Judge Vince Chhabria now has to decide whether to allow this defense, a decision that will have consequences for not only this but many other AI lawsuits involving things like shadow libraries. The BitTorrent uploading and distribution claims are the last element of this particular lawsuit, which has been rumbling on for three years now, to be settled.

(page 2) 50 comments
sorted by: hot top controversial new old
[–] ryathal@sh.itjust.works 5 points 3 weeks ago (2 children)

Arguing that training models isn't fair use us going to be a massive uphill battle, it's basically reading the book but with a computer. It's not actually a big deal to people, unless you hold the copyright to a ton of works and want to get a percentage of all the AI income these companies have made.

Torrenting the books is likely absolutely copyright infringement, but that has relatively low payout compared to the money these companies are getting for their models. The training being fair use means that rights holders can't try to take any money from the model's use. The statutory limits for infringement even at per work levels aren't significant compared to the legal cost of proving it happened.

[–] OfCourseNot@fedia.io 4 points 3 weeks ago (2 children)

There's an argument to be made that it is, in fact, not 'reading'. The training of the model could be considered a lossy compression of the data. And streaming movies in a lossy compression format is not fair use, is it?

[–] Fatal@piefed.social 4 points 3 weeks ago

It's not the storage of the information that matters as much as the presentation. Google's search index stores a huge amount of copyrighted material, even losslessly. But they only present small snippets at a time which is not considered copyright infringement. The question really is whether or not the information being presented by the models is in a format which is considered copyright infringement. So far, courts have not found that they are.

[–] ryathal@sh.itjust.works 2 points 3 weeks ago

The model doesn't stream out anyone's content though. The article mentions that the plaintiffs have provided no examples of a prompt that creates anything substantial.

Streaming a lossy compression would generally be infringement, but there is definitely a point where it becomes not infringement if it's lossy enough.

What a model generally stores, is factual information that isn't copyright in the first place. It's storing word counts, sentence lengths, sentiment analysis, and so on.

[–] FatCrab@slrpnk.net 3 points 3 weeks ago (1 children)

Anthropic pirating books for their training corpus resulted in the biggest copyright settlement in history--well over a billion. That is still being quibbled over i believe, but they settled because they were likely to pay out more if the case went forward. So I'm not really sure where you're coming from that infringement via torrenting does not result in monstrously large liability.

[–] ryathal@sh.itjust.works 2 points 3 weeks ago (2 children)

The judge in that case ruled the training wasn't fair use for pirated books, which left them on the hook for potentially all revenue (likely a court determined percentage) that the model generated for them in addition to statutory damages. That is well north of 1.5 billion.

[–] FatCrab@slrpnk.net 1 points 3 weeks ago

Just noticed your reply and want to correct this. Anyhropic settled, the 1.5bil was not a judgment against them. Specifically, this covered the literal pirating of the training corpus. It had absolutely nothing to do with the way training on the data handled the training data--they literally torrented an enormous portion of their training corpus.

Anthropoc DID try to argue that because they used the pirated material for training a model, it was fair use. The judge correctly decided that doesn't make any fucking sense. Again, this is not about the models encoding data, it is literally just about the fact that these silly fucks torrented vast portions of their training corpus like college students building a porn library on college broadband.

load more comments (1 replies)

As long as they cannot copyright what they generate from using the pirated materials

[–] nutsack@lemmy.dbzer0.com 4 points 3 weeks ago

I saw this coming from 69 miles away

[–] Grimy@lemmy.world 3 points 3 weeks ago* (last edited 3 weeks ago)

They didn't say seeding is fair use, just inherently part of torrenting. Good thing Sarah Silverman has pc gamer there to pander for her.

[–] AffineConnection@lemmy.world 3 points 3 weeks ago

It's OK when corporations do it.

[–] whotookkarl@lemmy.dbzer0.com 3 points 3 weeks ago* (last edited 3 weeks ago)

Copyrights over 5-10 years or not held by the creator are stealing from the commons/public domain and there is no moral obligation to follow those laws, and some would say a moral responsibility to share pirated copies of those works to everyone, not just corpo slop machines. Also good luck proving leading AI is a good thing and not destroying education and critical thinking skills.

[–] HaunchesTV@feddit.uk 2 points 3 weeks ago

Just spitballing...

If you were to train a model on just one book, as long as you don't prompt it to create an exact copy (maybe just some indiscernible differences) then presumably that's fair use.

Then, since we know AI generated work can't be copyrighted, does that essentially create a copyright-free version of the text which can be freely distributed?

[–] Strider@lemmy.world 1 points 3 weeks ago* (last edited 3 weeks ago)

Yeaaah well. I'm just gonna say everything is free now.

(except if I explicitly want to give someone money of course. Surely not a company)

[–] oscarpizarro@masto.es 1 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

@artifex

La información debe ser libre.

En lo personal, no por defender las leyes estadounidenses, tampoco por defender a meta. Digo esto para que no caigan en opiniones vacias sobre lo que soy o dejo de ser.

[–] artifex@piefed.social 2 points 3 weeks ago

A reasonable copyright is a good thing - it gives authors a limited period of exclusivity on their work, after which it becomes a part of our general culture. What people are upset about, I think, is how the biggest companies are "allowed' to violate copyright in the name of business, while the rest of us are not.

Traducción automática porque mi nivel de español en DuoLingo es solo 35):

Un derecho de autor razonable es algo positivo: otorga a los autores un periodo limitado de exclusividad sobre su obra, tras la cual pasa a formar parte de nuestra cultura general. Lo que a la gente le molesta, creo, es cómo a las empresas más grandes se les "permite" violar los derechos de autor en nombre de los negocios, mientras que el resto de nosotros no.

load more comments
view more: ‹ prev next ›