LocalLLaMA
Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.
Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.
As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.
Rules:
Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.
Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.
Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.
Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.
view the rest of the comments
I don't see how that would be fair use or what the argument is supposed to be.
Let me warn you that Lemmy is full of disinformation on copyright. If you picked the idea up here, then it probably is absolutely bonkers.
In any case, fair use is a US thing. In the EU, it would still be yoink.
I think I used a bit too much sarcasm. I wanted to take a spin on the idea how the AI industry simultaneously uses copyright, and finds ways to "circumvent" the traditional copyright that was written before we had large language models. An AI is neither a telephone-book, nor should every transformative work be Fair Use, no questions asked. And this isn't really settled as of now. We likely need some more court cases and maybe a few new laws. But you're right, law is complicated, there is a lot of nuance to it and it depends on jurisdiction.
Alas, we have reached the max comment depth. I cannot reply to your latest comment.
I see what you mean now. It's tricky. It's just another way in which copyright talking points cause problems.
You're saying that using/copying something you have in a database for AI training should always be legal. However, copying something to add it to the database should be judged as if it was done for enjoyment. EG everyone who torrents a movie should be treated the same, regardless of purpose. This will certainly cause problems for some scientific datasets.
Whether you downloaded a legal copy depends on whether the party offering the download had the right to do so. Whether that is the case may not be apparent. The first question is: What duty does someone have to check the provenance of content or data?
Torrents of current movies and the like are very obviously not authorized. For older movies, that becomes less clear. The web contains much unauthorized content. For example, the news stories that people copy/past on Lemmy. What duty is there to determine the copyright status of the content before using such data?
When researchers and developers share datasets, what duty do they have to check how the contents were obtained by whoever assembled it?
What happens when something was wrongly included in a dataset? Is that a problem only for the original curator, or also for everyone who got a copy?
What about streams, live TV, radio, and such things? Are you allowed to record those for training or not?
That's not quite right. Ultimately, Fair Use derives from the US Constitution; from the copyright clause but also freedom of speech. Copyright law spells out 4 factors that must be taken into account. But courts may also consider other factors. There is also no set way in which these factors have to be weighed. It's very open.
There are minimum conditions before prosecution is possible. I think uploading can always be prosecuted.
Well, over the last few decades it has only been going in the other direction.
How does this fit together with calling copyright infringement theft?
Let me make a suggestion. This is your real opinion. This is what you believe based on what you see. The rest is just slogans by the copyright industry, which you repeat without thinking. The problem is that you are basically shouting yourself down; your own opinion. The media, a big part of the copyright industry, puts these slogans out. Their lobbyists demand favors and harsher laws from politicians. And when the politicians look at what voters think, they hear these slogans. That's one thing I mean when I say the copyright industry defrauds us.
Exactly, they don't pay more for the same thing. It's almost exclusive to the copyright industry.
Actually, even in the copyright industry, such terms are from universal. Of course, you will have to pay more for the right to make copies than for a single copy. And even more for the exclusive copyright. Those things are different. However, it's usually a flat fee. Can you figure out what economic reasons might exist for a creator being paid per copy or per viewer?
"No exceptions" means, for example, that a LLM would not be able to answer questions about politicians, actors, musicians, maybe not even about historical figures.
You said that there should be a way that you can remove your personal data from the training set. That implies that an AI company can offer money in exchange for people not removing their data. That's basically a licensing fee, however it is framed.
On second thought, I believe many celebrities, business people, politicians, ... will gladly offer more training data that makes them look. They'd only remove data that makes them look bad. Sort of like how the GDPR works. Far from demanding a licensing fee, they'd pay money to be known by the AI.
I agree that the situation is far from ideal. But let me point out that you do not have a right to other people's computer services. That's the issue with Alibaba hitting your server, right? It's a difficult issue. Mind that an opt-out from AI training does not actually address this.
How so?
Oh, wow.
I mean for some questions, we already have an old way of doing it and it's relatively straightforward to apply it:
Selling/Buying something is a very common form of contract. In our economy, the parties themselves decide what's in the contract. I can buy apples, cauliflower or wood screws per piece or per kilogram. That's down to my individual contract between me and the supermarket (or hardware store) and nothing the government is involved in. It's similar with licensing, that's always arbitrary and a matter of negotiation.
Of course for everyone. If I download a torrented copy of a Hollywood film, that's not "healed" by it being a copy of a copy. It's still the same thing.
It's due diligence. Especially once someone uses (or publishes) something. And it very much depends on circumstances. Did they do it deliberately, specifically ignoring being in violation of something? If they were wrongfully under the assumption it was a legal copy, then it's more analogous to fencing. They're not in trouble for stealing anymore, but can be ordered to let go of the stolen goods. I'd say that's pretty much the same liability as with other things. I kill someone with my car. Now the question is have I been neglectful? Did I know the brakes were faulty but I didn't repair them and used the car nonetheless? Or did the car manufacturer mess up? There might be a case against me, or the manufacturer or both. And both civil and criminal law can be involved in different ways.
I'd do it like with shipments in the industry. If you receive a truck load of nuts and bolts, you take 50 of them out and check them before accepting the shipment and integrating the lot into your products.
Though that is very hypothetical. If the torrent has annas-archive or libgen.is in the title... It's pretty obvious. And that was what happened here. They did it deliberately and we know they knew.
And this week the next lawsuit started, alleging they (Meta) uploaded tons of porn videos (illegally) to be able to download what they were interested in, since Bittorrent has a tit for tat mechanism.
So I believe we first need to address the blatant piracy before talking about hypothetical scenarios. I believe that's going to be easier, though. I proposed to mandate transparency with what a company piled up in a dataset. One of the reasons was to address this. Like with the DMCA and GDPR, this could be a relatively simple mechanism where the provider (or company) gets some leeway, since it indeed is complicated. People will get a procedure to file a complaint and then someone can have a look whether it was wrongly included.
I wasn't concerned with copyright here. Let's say I'm politically active and someone leaks my address and now people start showing up, throwing eggs at my front door and threatening to kill me. Or someone spreads lies about me and that gets ingested. Or I'm a regular person and someone posted revenge porn of me. Or I'm a victim of a crime and that's always the fist thing that shows up when someone puts in my name and it's ruining my life. That needs to be addressed/removed. Free of charge. And that has nothing to do with licensing fees for content or celebrities. When companies use data, they need to have a complaints department and that will immediately check whether the complaint is valid and then act accordingly. There needs to be a distinction between harmful content and copyright violations.
Thanks for explaining. I didn't know those were only guidelines. But it makes sense and that's generally different between common law countries and whatever we are called, civil law countries?
And that is for a good reason. Generally physical things can't be copied easily. So handling copying isn't really necessary with physical goods. That's kind of in the word "copyright". Though when licensing for example immaterial goods, you're also buying a different license and different rights, and not the same thing.
Maybe think more in terms of services and licensing, since that's the main point here. In the material world that'd be something like the difference between renting an excavator for 2 weeks or buying the same one. It'll be exactly the same excavator I get. It's going to be a very different number on the bill, and I get different rights and obligations.
Sure. Since I grew up with the German model, I'd open yet another category for AI training so it can be handled specifically. I mean it doesn't really fit into anything existing. AI is neither making copies, nor a copy, but still it uses it. And it's also not an art form or a citation. So I need a good argument why it needs to be mushed together with something else.
And datasets and model-weights are yet different things. Since we agree that AI training is transformative, we can confine copyright to the datasets and it's not much of an issue with the learned model weights. Or at least it shouldn't be. And I mean we have enough other issues to deal with that arise from the models itself.
I think you underestimate the consequences. The AI Fair Use plus the illegitimate scraping lead to a quite substantial war on the internet. Now every entity is fighting for their own. People like me are at the bottom of the chain and we have to protect our servers simply so they don't burn down. Big content platforms wage war as well. They don't want "their" content to be scraped. Leaving it open like before only cuts into their business. They rather sell it themselves. So they started making lots of things inaccessible by technical means, and combat the freedom we had before.
And that's the conundrum. In practice, this leads to the opposite. My own Fair Use of content (and that of other normal people and smaller businesses) is collateral damage. I used to archive some videos and I run a PeerTube instance. And now Google blocks all datacenter IPs, so you can just watch Youtube from a residential internet connection. They introduced rate-limiting. Reddit's API debacle in 2023 was largely about this. Countless other services and platforms have become enshittified due to this. And many more will.
Idk if the average consumer already notices. But it's really bad once you look under the hood. And this is not sustainable. And benefactors of this war are mostly big companies. Like Reddit, who found a way to make profit off of it. And Cloudflare, who were way too big of a dominating central internet instance before, and now they're the arms dealer in the war against scraping and that makes them even bigger.
All the while the internet gets more locked down, enshittified... And everyone who isn't the big content industry or already a monopolist, loses.
See my text above. Even if it was a nice idea, it leads to the opposite in the real world. A few big internet companies "win" in this war with technology, disregarding the idea behind the law, and everyone exept them loses. Cementing monopolies, not helping with them.
And more generally because most AI companies are billion dollar companies, they own half the internet ("land"), and a random nonfiction book author is a random individual with a very moderate income. And Fair Use now says the labour of the small guy is free of charge for the big company.
Ultimately, I'm not set on any ideology here. I'm regularly more concerned with making things work. And that's my goal here, too. I want a world that includes the existence of books and TV shows. So I need people to do that job. Now jobs can't be done if the people doing it starve.
Copyright is just a tool trying to achieve that. And kind of an half-way obsolete one with a lot of negative side-effects. I'm not set on it. We just need a way so books and TV shows are still a thing in 20 years. And that's my concern here and why I talk a lot about labour involved, and never about how they deserve to get rich if they're popular or if they manage the stuff.
And I see roughly 3 options for the future: a) Nobody pays them, or b) people who make use of their labour pay them, or c) some people pay, some get a free pass.
And the way I see it is a) is a future where quality and professional content is likely going to vanish big scale. And I'm not sure if the exact pre-copyright model applies to our modern world. Things have changed. For example copying things was an expensive process back then and required very expensive machinery. When it's done at no cost and by everyone in the digital age. b) is what I'm advocating for. Everyone needs to pay. Preferrably not every taxpayer, but people who actually use the stuff. And c) is what I called a "subsidy" when everyone gets to use it but only a group of people pays for everyone.
I mean what's your idea here? I can't really tell. Let's say we're not set on copyright. How do $90,000 arrive at a book author each year so it's a viable job and they can create something full time? And I'd like a fair solution for society.
I'm changing the order some, because I want to get this off my chest first of all.
That's not what I'm seeing. Here's what I'm seeing:
First, you start out with a little story. Remember my post about narratives?
You emphasize what "needs" to be achieved. You try to engage the reader's emotions. What's completely missing is any concern with how or if your proposed solution works.
There are reputation management companies that will scrub or suppress information for a fee. People who are professionally famous may also spend much time and effort to manipulate the available information about them. Ordinary people usually do not have the necessary legal or technical knowledge to do this. They may be unwilling to spend the time or money. Well, one could say that this is alright. Ordinary people do not rely on their reputation in the same way as celebrities, business people, and so on.
The fact is that your proposal gives famous and wealthy elites the power to suppress information they do not like. Ordinary people are on their own, limited by their capabilities (think about the illiterate, the elderly, and so on).
AIs generally do not leak their training data. Only fairly well known people feature enough in the training data so that a LLM will be able to answer questions about them. Having to make the data searchable on the net, makes it much more likely that it is leaked with harmful consequences. On balance, I believe your proposal makes things worse for the average person while benefit only certain elites.
It would have been straightforward to say that you wish to hold AI companies accountable for damage caused by their service. That's the case anyway; no additional laws needed. Yet, you make the deliberate choice to put the responsibility on individuals. Why is your first instinct to go this round-about route?
But market prices aren't usually arbitrary. People negotiate but they usually come to predictable agreements. Whatever our ultimate goals are, we have rather similar ideas about "a good deal".
All very reasonable ideas. Eventually, the question is what the effect on the economy is, at least as far as I'm concerned.
These tests mean that more labor and effort is necessary. Mistakes are costly. These costs fall on the consumer. The big picture view is that, on average, either people have less free time because more work is demanded, or they make do with less because the work does not produce anything immediately beneficial. So the question is if this work does lead to something beneficial after all, in some indirect way. What do you think?
No. That is the immediate hands-on issue. As you know, the web is full of unauthorized content.
Well? What's your pitch?
That is not happening, though?
You compare intellectual property to physical property. Except here, where it becomes "labor". I don't think you would point at a factory and say that it is the owner's labor. If some worker took some screws home for a hobby project, I don't think you would accuse him of stealing labor. Does it bother you how easily you regurgitate these slogans?
Good question. That's an economics question. It requires a bit of an analytical approach. Perhaps we should start by considering if your idea works. You are saying that AI companies should have to buy a copy before being allowed to train on the content. So: How many extra copies will an author sell? What would that mean for their income?
We should probably also extend the question beyond just authors. Publishers get a cut for each copy sold. How many extra copies will a publisher sell and what does that mean for their income?
Actually, the money will go to the copyright owner; often not the same person as the creator. In that way, it is like physical property. Ordinary workers don't own what they produce. A single daily newspaper contains more words than many books. The rights are typically owned by the newspaper corporation and not the author. What does that mean for their income?
I think you're a bit too focused on narratives. I mean how am I supposed to share my perspective without sharing my perspective? Of course that's going to include stories about bad things that happened to me. I've handled some privacy and personal information related issues for not so tech-savy people. You should feel privileged if you didn't have a lot of bad or complicated things happen to you, but I can assure you there are ordinary people with different stories. I didn't handle death threats, but there were some other legitimate reasons, from simple job related to bad and disgusting. And we can't just throw those people under the bus and say »yeah, your well-being just cuts into profit«...
This isn't copyright, so I'm going to move on. But this goes hand in hand with other regulations for datasets and online services.
Well, if I ask them about events and organizations I was part of, AI does seem to know details. And those were small and local things. No celebrities involved. AI however hallucinates a lot and >80% of names or details are currently made up. I bet AI is going to become better, though. It's definitely already able to connect some lesser-known names.
Could be the same here. Maybe the free market will arrive there after things settled down. You're right, the content industry is a shitty corner of the market. I'd like to mention Spotify as precedent, who are able to license pretty much all important music, despite paying next to nothing to artists. Or my university library, who were able to stock pretty much all important books for their students. This might be achievable in some way for AI, too. Other businesses seem to be able to obtain special licenses for use-cases other than be a regular customer.
No one promised it has to be easy. Other products also cost some extra because we have some minimum requirements. For food safety, cars, fair rides... I wouldn't want to do away with that, so I think this always leads to something beneficial. We just need to strike a balance. Every now and then a rollercoaster crashes and people die. Nothing is perfect. We collectively decide what rate of rollercoaster crashes we deem acceptable. And then the experts write some regulations to achieve that.
Pretty much what I'm arguing for, here. Discard the idea which causes it. It's obviously not working. Likely because it's too simplistic.
As I told before, it already happened. Three years ago I and any independent researcher was able to use the Reddit API and use Youtube. Now we're not. And the monopolists struck deals amongst themselves. Ever wondered why many more paywalls popped up with news outlets lately? Cloudflare and Anubis checks before a page loads? You get locked out of codeberg for 24h+ and can't update your server? Your alt account gets deactivated for "suspicious activity"? That's all indications something has happened behind the scenes. And it achieves the desired effect. More and more information is now under tighter control. For the AI companies and for everyone.
And all of this happened to me, along with me needing to do the same since they also showed up at my front door. The rate of this happening correlates perfectly. And from personal experience and talking to other admins, I know bots and scraping are the cause.
What slogan? And what hobby project? ChatGPT certainly isn't a hobby project. That thing costs some 3 digit millions of dollars per iteration. And they're also not taking a few screws. They're the employee who takes one screw out of each other packet and with the throughput, they have a nice side-business with the screws.
I was trying to make a point here: Take away copyright since we both don't like it... Now what remains? I think the labour of the author.
And since we're always discussing feudalism and a monopoly... Am I right here and that's the AI industry, or did I miss something? In my eyes, we're currently at Google (which is a monopolist), Microsoft (another monopolist) and the other 51% of OpenAI which seem very well off, we have Apple (I think also monopolist, and they're also in the top 10 richest companies). Nvidia does AI and they're torpedoed to top market cap by AI and have monopoly-like margins. Then we have Meta and Elon Musk's companies in the business and also valued a trillion dollars. Then we have "startups" funded by public money from the Chinese government. Anthropic (interestinly enough now sued by Reddit for scraping their data), Elevenlabs, and in Europe: Mistral, Stability.ai and Black Forest Labs. (And a few other players like Standford and other universities, smaller companies/startups and quite an active fine-tuning community.)
That's pretty much what I read about. Most of them are just the richest companies on planet earth. Several of them are monopolists. Some happen to be the ones who own the big platforms that make up the internet. So if we now say AI training supplies needs to be cheaper, whether that's right or wrong... You know who 90% of that benefit goes to? ...Them.
And that's not wrong. They have a legitimate business and it's not wrong to make money selling GPUs or AI. It's just that you can't say you're against feudalism and monopolies, and then devise a rule and the list of the main benefactors is just a list dominated by monopolies and feudalism from before. There is some desired outcome but that's just among the also-rans.
That's just you being against monopolies where it suits you and you're completely oblivious to them in other areas. En large, probably enabling them.
Now the content industry is bad as well. And we find Disney, Warner Bros, Nexflix in the list of Fortunate 500 companies. Seems the publishing houses aren't even amongst them. And now you want to redistribute resources and the main chunk moves up the chain to the select top. Most of them have several ruling against them for having (for example) devised ecosystems to arrive at a monopoly and then subsequently abuse the powers that come with it. You didn't level the playing field but we can tell from the last few years and how AI law of the USA turned out, you mainly helped the big companies and monopolists. And we can have a look at the financial figures and they're mostly doing record profit since Covid while that's not the case for average economy. Now who do we seem to funnel value towards in practice? And why do these companies by large happen to be identical to the internet feudalism from before gen-AI?
Well, I'm open to other ideas than mine. I mean you propose a clear solution here: Fair Use. Now I would have expected you to have analysed the situation and have some solution on how that content is supposed to get there. I mean it's not created out of thin air. And the other side of the coin has to be factored in as well once we're talking about introducing laws.
I think the entire content industry isn't a healthy model. And the average individuals working there aren't well off. And it doesn't seem like we're on a path where this is going to improve in the future. So there aren't any "extra copies". And these people don't have gifts to hand out.
In some cases we already know AI directly takes away. Freelancers, like illustrators, musicians... Without an industry and other entities in between, they're the first who get somewhat fed upon and the same thing directly takes away their business opportunities.
So what's with content in the early 21st century and in the upcoming age of AI? Is it as easy as leave everything as is and slap Fair Use on top? Does that solve a single issue with anything? Or is that just supposed to make business cheaper for some AI companies with a random effect on everyone else?