this post was submitted on 31 Jul 2025
30 points (89.5% liked)

No Stupid Questions

42629 readers
474 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)


Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.



Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.



Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.



Rule 4- No self promotion or upvote-farming of any kind.

That's it.



Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.



Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.



Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.



Rule 8- All comments should try to stay relevant to their parent content.



Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.



Rule 10- Majority of bots aren't allowed to participate here. This includes using AI responses and summaries.



Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 2 years ago
MODERATORS
 

AI's become so invasively popular and I've seen more evidence of its ineffectiveness than otherwise, but what I dislike most about it is that many run on datasets of stolen data for the sake of profitability à la OpenAI and Deepseek

https://mashable.com/article/openai-chatgpt-class-action-lawsuit https://petapixel.com/2025/01/30/openai-claims-deepseek-took-all-of-its-data-without-consent/

Are there any AI services that run on ethically obtained datasets, like stuff people explicitly consented to submitting (not as some side clause of a T&C), data bought by properly compensating the data's original owners, or datasets contributed by the service providers themselves?

all 18 comments
sorted by: hot top controversial new old
[–] decended_being@midwest.social 28 points 2 days ago (1 children)
[–] Treczoks@lemmy.world 12 points 2 days ago (1 children)

There are no legal sources big enough to train an AI on the level required to even perform basic interaction.

This is very true.

I was part of the OpenAssistant project, voluntarily submitting my personal writing to train open-source LLMs without having to steal data, in the hopes it would stop these companies from stealing people's work and make "AI" less of a black box.

After thousands of people submitting millions of prompt-response pairs, and after some researchers said it was the highest quality natural language dataset they'd seen in a while, the base model was almost always incoherent. You only got a functioning model if you just used the data to fine-tune an existing larger model, Llama at the time.

[–] partial_accumen@lemmy.world 17 points 2 days ago (1 children)

Are there any AI services that don't work on stolen data?

Yes, absolutely, but I don't think that's the question you want the answer to. There are many places where AI is used inside companies or hobby project where the specific problem to be solved is very specific and other peoples stolen data wouldn't help you anyway.

Lets say you're a company that sells items at retail online, like a Walmart or Amazon. You want an AI model to be able to help your workers better select the size of box to pack the various items in for shipment to customers. You would input your past data for shipments you've sent including all the dimensions of your products you're selling (so that data isn't stolen), and input all of the sizes of boxes you have (they're your boxes so also not stolen). You'd then could create an Unsupervised Classifier AI model based on linear regression. So the next time you have a set of items that need to be shipped out you'd input those items, and the model would tell you the best box size to use. No stolen data in any of this.

Now, the question I think you're asking is actually:

"Are there any LLM AI chatbot services that don't work on stolen data?"

That answer, I don't know. Most of the chatbot models we're given to set up chatbots are pretrained by the vendor and you simply input your additional data to make knowledgeable on specific niche subjects.

[–] BartyDeCanter@lemmy.sdf.org 10 points 2 days ago (1 children)

Exactly this. There are plenty of ML/AI systems that build on public datasets, such as AlexNet for image recognition and even some LLMs that are trained on out of copyright documents such as the Project Gutenberg collection. But they almost certainly aren’t what you are looking for.

[–] velummortis@lemmy.dbzer0.com 2 points 1 day ago

No, these seem like actually good examples - I'd be interested to use those if they're publicly available

[–] Pamasich@kbin.earth 4 points 1 day ago

Switzerland announced a new LLM project which might be of interest here.

Here's a German article on it. If you're okay with a Reddit link, here's a translation.

Some points on it:

  • fully open source in its entirety — source code, model weights, and training data will all be publically released.
  • licensed under Apache 2.0
  • compliant with Swiss data protection laws, copyright law, and the EU AI act
  • respects crawler opt-outs on websites

While nothing there explicitly says the data is ethically sourced, we'll be able to tell based on the opensource training data, and I assume copyright law takes care of stuff like books being used (though idk if the AI has a way to determine the license of web content, or if it fully relies on opt-outs there).

[–] razorcandy@discuss.tchncs.de 5 points 2 days ago (3 children)

Some machine learning models are trained on what’s called synthetic data, which is generated specifically for that purpose and mimics real-world data. What I don’t know is how much of the data used is synthetic vs. stolen.

[–] iii@mander.xyz 15 points 2 days ago (1 children)

Then again, the synthetic data is generated with previous generation models, that were trained on scraped data.

Turtles all the way down

[–] skulblaka@sh.itjust.works 5 points 2 days ago

Stolen, but this time passed through an additional bullshit layer for even less reliable results! Buy now!

[–] spankmonkey@lemmy.world 5 points 2 days ago

If the real world data it is based on was stolen then using the synthetic version still counts as stolen.

[–] TheLeadenSea@sh.itjust.works 4 points 2 days ago (1 children)

You can't steal data, just illegally copy it. So no LLM is trained on stolen data.

[–] piecat@lemmy.world 4 points 1 day ago

Okay, so conversion or unjust enrichment instead of literal theft.

[–] TriflingToad@sh.itjust.works 2 points 1 day ago

iirc the AI in Adobe Photoshop is only trained off of the stock images they have the rights to

could be wrong tho, I don't use Adobe

[–] platypode@sh.itjust.works 3 points 2 days ago (1 children)

Getty Images has an image generator trained exclusively on licensed images. I’m not aware of any text generators that do the same.

[–] velummortis@lemmy.dbzer0.com 1 points 1 day ago

Oh, interesting! I'll also take a look at that one

[–] valek879@sh.itjust.works 2 points 2 days ago

I heard about Notebook LM recently. I couldn't tell you what it's trained on but I'm order to use the LLM you need to provide it source material.

So say you're writing something for school. You can gather 50+ papers on the subject you're trying to write about, upload them, then ask the LLM about what you uploaded. Sounds like turning research from a search for info to an interview with an "expert."

Again I can't speak to how it was trained in the background but this seems genuinely useful.