vcmj

joined 2 years ago
[–] vcmj@programming.dev 1 points 2 years ago (1 children)

I've not played with it much but does it always describe the image first like that? I've been trying to think about how the image input actually works, my personal suspicion is that it uses an off the shelf visual understanding network(think reverse stable diffusion) to generate a description, then just uses GPT normally to complete the response. This could explain the disconnect here where it cant erase what the visual model wrote, but that could all fall apart if it doesn't always follow this pattern. Just thinking out loud here

[–] vcmj@programming.dev 1 points 2 years ago* (last edited 2 years ago) (1 children)

Thanks for the detailed reply, I see that I did indeed misunderstand what he was saying. I'm an R&D engineer so I guess my knee jerk response to character level mischief is exactly what you said, it can't see them anyway, I already knew that so I dismissed that possible interpretation in my mind straight out the gate. Maybe I should assume zero knowledge of internal AI workings reading commentary in the wild.

Edit: Actually just thought of a good analogy for this. Say I play a sound and then ask you what it is of. You might reply "it sounds like a bell", but if I asked exactly the composition of frequencies that made the sound, you might not be able to say. Similarly the AI sees a group of letters as a definite "thing" (token) but it doesn't know what actually went into that because its "ears"(tokenizer) already reduced it to a simpler signal.

[–] vcmj@programming.dev 0 points 2 years ago (3 children)

?? Literally the entire purpose of the transformer architecture is to manipulate text, how is it bad at that? Am I misunderstanding this? Summarization, thematic transformation, language translation etc are all things AI is fantastic at...

view more: ‹ prev next ›