this post was submitted on 28 Jan 2025
845 points (94.7% liked)

memes

16768 readers
2597 users here now

Community rules

1. Be civilNo trolling, bigotry or other insulting / annoying behaviour

2. No politicsThis is non-politics community. For political memes please go to !politicalmemes@lemmy.world

3. No recent repostsCheck for reposts when posting a meme, you can only repost after 1 month

4. No botsNo bots without the express approval of the mods or the admins

5. No Spam/Ads/AI SlopNo advertisements or spam. This is an instance rule and the only way to live. We also consider AI slop to be spam in this community and is subject to removal.

A collection of some classic Lemmy memes for your enjoyment

Sister communities

founded 2 years ago
MODERATORS
 

Office space meme:

"If y'all could stop calling an LLM "open source" just because they published the weights... that would be great."

you are viewing a single comment's thread
view the rest of the comments
[–] Prunebutt@slrpnk.net 29 points 6 months ago* (last edited 6 months ago) (1 children)

The point of open source is access to reproducability the weights are the end products (like a binary blob), you need to supply a way on how the end product is created to be open source.

[–] WraithGear@lemmy.world -2 points 6 months ago* (last edited 6 months ago) (2 children)

So its not how it tokenized the data you are looking for, it’s not how the weights are applied you want, and it’s not how it functions to structure the output you want because these are all open… it’s the entirety of the bulk unfiltered data you want. Of which deepseek was provided from other ai projects for initial training, can be changed to fit user needs, and doesnt touch on at all how this LLM is different from other LLM’s? This would be as i understand it… like saying that an open source game emulator can’t be open source because Nintendo games are encapsulated? I don’t consider the training data to be the LLM. I consider the system that manipulated that data to be the LLM. Is that where the difference in opinion is?

[–] Prunebutt@slrpnk.net 19 points 6 months ago (1 children)

it’s the entirety of the bulk unfiltered data you want

Or more realistically: a description of how you could source the data.

doesnt touch on at all how this LLM is different from other LLM’s?

Correct. Llama isn't open source, either.

like saying that an open source game emulator can’t be open source because Nintendo games are encapsulated

Not at all. It's like claiming an emulator is open source, because it has a plugin system, but you need a closed source build dependency that the developer doesn't disclose to the puplic.

[–] whotookkarl@lemmy.world 10 points 6 months ago* (last edited 6 months ago) (1 children)

A closer analogy would be only providing the binary output of the emulator build and calling it open source. If you can't reproduce building the output from what they provide in what way is it reproducible? The model is the output, the training data and algorithm to build the model based on the training data are the input.

Edit: Say I have a Java project I want to open source. Normally (oversimplifying a bit) it goes .java source files used with a compiler to build intermediate bytecode in .class files, then there's a just in time (JIT) compilation to create the binary code as it runs in the JVM. It's not open source if I only share the class files, even if I can use them to recreate source files that can be recompiled into the same class files. Starting at an intermediate step of the process isn't the source.

[–] WraithGear@lemmy.world -3 points 6 months ago (1 children)

Would it? Not sure how that would be a better analogy. The argument is that it’s nearly all open… but it still does not count because the data set before it’s manipulated by the LLM (in my analogy the data set the emulator is using would be a Nintendo ROM) is not open. A data set that if provided would be so massive, it would render the point of tokenization pointless and be completely unusable by literally ANYONE without multiple data centers redlining for WEEKS. Under that standard of scrutiny not only could there never be an LLM that would qualify, but projects that are considered open source would not be. Thus making the distinction meaningless.

An emulator without a ROM mounted is still an emulator, even if not usable.

[–] FooBarrington@lemmy.world 3 points 6 months ago

I don't understand your objections. Even if the amount of data is rather big, it doesn't change that this data is part of the source, and leaving it out makes the whole project non-open-source.

Under that standard of scrutiny not only could there never be an LLM that would qualify, but projects that are considered open source would not be. Thus making the distinction meaningless.

What? No? Open-source projects literally do meet this standard.