In the case of Stable Diffusion, they used 5 billion images to train a model 1.83 gigabytes in size. So if you reduce a copyrighted image to 3 bits (not bytes - bits), then yeah, I think you're probably pretty safe.
FaceDeer
The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it's unlikely they'll conclude that the models contain the data, however, because it's objectively not true.
The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the "5B" indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it's compressing images and storing them inside the model it'd somehow need to fit ~2.7 images per byte. This is, simply, impossible.
In exchange, the WHO gets to make sane policies about vaccines, women's health, and sexual identity. Could easily be worth it.
That makes it worse, actually. The half that doesn't forgive you is the one that's the asshole. If you cut it in half lengthwise then both of them get the asshole.
You said:
What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.
But the point is that it doesn't matter if the data is licensed or not. Lack of licensing doesn't stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?
They pulled a very public and out in the open data heist and got away with it.
They did not. No data was "heisted." Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.
That's not what's going on here, though. The LLM model doesn't contain the actual copyrighted data, it's the result of analyzing the copyrighted data.
An analogous example would be a site like TV Tropes. TV Tropes doesn't contain the works that it's discussing, it just contains information about those works.
Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.
There's no need to "make it legal", things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.
Training an AI doesn't involve copying the training data, the AI model doesn't literally "contain" the stuff it's trained on. So it's not likely that existing copyright law makes it illegal to do without permission.
Are you threatening me with a good time?
First of all, whether these LLMs are "illegally trained" is still a matter before the courts. When an LLM is trained it doesn't literally copy the training data, so it's unclear whether copyright is even relevant.
Secondly, I don't think that making these models "public domain" would have the negative effects that people angry about AI think it would. When a company is running a closed model internally, like ChatGPT for example, the model is never available for download in the first place. It doesn't matter if it's public domain or not because you can't get a copy of it. When a company releases an open-weight model for public use, on the other hand, they usually encumber them with some sort of license that makes them harder for competitors to monetize or build on. Making those public-domain would greatly increase their utility. It might make future releases less likely, but in the meantime it'll greatly enhance AI development.
The Olamic Quietude. A colony of transhumanists just doing their own thing for fifteen thousand years, developing some really spiffy tech and not being afraid of technology. Then the Imperium shows up and wipes them out.
I know the Imperium has done a lot of BS, but that one sticks in my mind for some reason. I really wish the Quietude had survived somehow.
You've got your definition of "derivative work" wrong. It does indeed need to contain copyrightable elements of another work for it to be a derivative work.
If I took a copy of Harry Potter, reduced it to a fine slurry, and then made a paper mache sculpture out of it, that's not a derivative work. None of the copyrightable elements of the book survived.