lily33

joined 2 years ago
[–] lily33@lemmy.world -3 points 2 years ago (2 children)

If you give me several paragraphs instead of a single sentence, do you still think it's impossible to tell?

[–] lily33@lemmy.world 2 points 2 years ago* (last edited 2 years ago) (6 children)

I don't see how that affects my point.

  • Today's AI detector can't tell apart the output of today's LLM.
  • Future AI detector WILL be able to tell apart the output of today's LLM.
  • Of course, future AI detector won't be able to tell apart the output of future LLM.

So at any point in time, only recent text could be "contaminated". The claim that "all text after 2023 is forever contaminated" just isn't true. Researchers would simply have to be a bit more careful including it.

[–] lily33@lemmy.world 6 points 2 years ago (11 children)

Not really. If it's truly impossible to tell the text apart, than it doesn't really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don't train AI on super-recent text anyway.

[–] lily33@lemmy.world 0 points 2 years ago

They don't redistribute. They learn information about the material they've been trained on - not there natural itself*, and can use it to generate material they've never seen.

  • Bigger models seem to memorize some of the material and can infringe, but that's not really the goal.
[–] lily33@lemmy.world 4 points 2 years ago* (last edited 2 years ago)

Language models actually do learn things in the sense that: the information encoded in the training model isn't usually* taken directly from the training data; instead, it's information that describes the training data, but is new. That's why it can generate text that's never appeared in the data.

  • the bigger models seem to remember some of the data and can reproduce it verbatim; but that's not really the goal.
[–] lily33@lemmy.world 2 points 2 years ago* (last edited 2 years ago) (2 children)

It's specifically distribution of the work or derivatives that copyright prevents.

So you could make an argument that an LLM that's memorized the book and can reproduce (parts of) it upon request is infringing. But one that's merely trained on the book, but hasn't memorized it, should be fine.

[–] lily33@lemmy.world -4 points 2 years ago* (last edited 2 years ago) (4 children)

Why should such a thing be assumed????

[–] lily33@lemmy.world 1 points 2 years ago* (last edited 2 years ago)

It's actually a real problem on reddit where people spin up fake users to manipulate votes. Reddit hasn't published how they detect that exactly, but one way to do that is to look for bad voting patters, like if one account systematically upvotes/downvotes another. But you pretty much can't without knowing the votes.

[–] lily33@lemmy.world 3 points 2 years ago (2 children)

True - but it'll be much easier to detect.

[–] lily33@lemmy.world 12 points 2 years ago* (last edited 2 years ago)

That last point is completely impossible. Don't forget that I don't have to run the official lemmy software on my instance. I can make changes: for example, I can add a feature to my instance like "log every post in a separate, local database before deleting it from lemmy". Nobody else but me will know this feature exists. Or (to be AGPL compliant) have a separate tool to regularly back up my lemmy database, undoing deletions.

As for the second point: I'd say making local votes private and non-local public will be worse for privacy due to causing confusion.

[–] lily33@lemmy.world 8 points 2 years ago* (last edited 2 years ago) (3 children)

I'd go the other way: make these things officially public, so people know they are, and then aren't taken by surprise.

Private voting can be tricky in a federated setting, because I could have a malicious instance that boosts my posts (I can have it with public votes too, but then it's easier to detect). Truly private posting history is outright impossible, as you said, due to crawlers.

The way to privacy is to make sure not to dox your account, perhaps alternate 2-3 accounts if it's really important to you.

[–] lily33@lemmy.world 21 points 2 years ago (4 children)

Frankly, I think someone should actually do that. Except maybe use open source AI instead of ChatGPT.

The fact is, in a federated setting all this data will be accessible. For example, if lemmy tried to hide who made each vote, and just federate totals, that would allow my malicious instance to report 1M upvotes for my post.

When lemmy tries to hide this data, all this does is instill a false sense of privacy with users. IMHO the best thing is to make all this de facto public data, officially public, so everyone knows and can act accordingly.

As for privacy, I'd say the best thing to do is, keep your account anonymous.

23
submitted 2 years ago* (last edited 2 years ago) by lily33@lemmy.world to c/linux@lemmy.ml
 

I'm looking for an open-source alternative to ChatGPT which is community-driven. I have seen some open-source large language models, but they're usually still made by some organizations and published after the fact. Instead, I'm looking for one where anyone can participate: discuss ideas on how to improve the model, write code, or donate computational resources to build it. Is there such a project?

view more: next ›