this post was submitted on 08 Feb 2024
907 points (98.8% liked)

Funny: Home of the Haha

7916 readers
911 users here now

Welcome to /c/funny, a place for all your humorous and amusing content.

Looking for mods! Send an application to Stamets!

Our Rules:

  1. Keep it civil. We're all people here. Be respectful to one another.

  2. No sexism, racism, homophobia, transphobia or any other flavor of bigotry. I should not need to explain this one.

  3. Try not to repost anything posted within the past month. Beyond that, go for it. Not everyone is on every site all the time.


Other Communities:

founded 2 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] dipshit@lemmy.world 21 points 2 years ago (8 children)

AI / LLM only tries to predict the next word or token. it cannot understand or reason, it can only sound like someone who knows what they are talking about. you said elephants and it gave you elephants. the “no” modifier makes sense to us but not to AI. it could, if we programmed it with if/then statements, but that’s not LLM, that’s just coding.

AI is really, really good at bullshitting.

[–] Turun@feddit.de 26 points 2 years ago* (last edited 2 years ago) (6 children)

AI / LLM only tries to predict the next word or token

This is not wrong, but also absolutely irrelevant here. You can be against AI, but please make the argument based on facts, not by parroting some distantly related talking points.

Current image generation is powered by diffusion models. Their inner workings are completely different from large language models. The part failing here in particular is the text encoder (clip). If you learn how it works and think about it you'll be able to deduce how the image generator is forced to draw this image.

Edit: because it's an obvious limitation, negative prompts have existed pretty much since diffusion models came out

[–] dipshit@lemmy.world 4 points 2 years ago (3 children)

Does the text encoder use natural language processing? I assumed it was working similarly to how an LLM would.

[–] Turun@feddit.de 4 points 2 years ago (1 children)

No, it does not. At least not in the same way that generative pre-trained transformers do. It is handling natural language though.

The research is all open source if you want details. For Stable Diffusion you'll find plenty of pretty graphs that show how the different parts interact.

[–] dipshit@lemmy.world 4 points 2 years ago (1 children)

There would still need to be a corpus of text and some supervised training of a model on that text in order to “recognize” with some level of confidence what the text represents, right?

I understand the image generation works differently, which I sort of gather starts with noise and a random seed and then via learnt networks has pathways a model can take which (“automagic” goes here) it takes from what has been recognized with NLP on the text. something in the end like “elephant (subject) 100% confidence, big room (background) 75% confidence, windows (background) 75% confidence”. I assume then that it “merges” the things which it thinks make up those tokens along with the noise and (more “automagic” goes here) puts them where they need to go.

[–] Turun@feddit.de 2 points 2 years ago

There would still need to be a corpus of text and some supervised training of a model on that text in order to “recognize” with some level of confidence what the text represents, right?

Correct. The clip encoder is trained on images and their corresponding description. Therefore it learns the names for things in images.

And now it is obvious why this prompt fails: there are no images of empty rooms tagged as "no elephants". This can be fixed by adding a negative prompt, which subtracts the concept of "elephants" from the image in one of the automagical steps.

load more comments (1 replies)
load more comments (3 replies)
load more comments (4 replies)