FaceDeer

joined 2 years ago
[–] FaceDeer@fedia.io 11 points 3 months ago

I imagine there's also an element of "what can we start building right now," as opposed to waiting a couple of years for R&D before setting up the production lines. A weapon system can be the most wonderful and powerful thing on paper but if you're under attack you can't deploy a piece of paper.

It's also nice that it turns out old American tech is perfectly capable of dominating Russia's current tech.

[–] FaceDeer@fedia.io 6 points 3 months ago

They're probably still waiting to see if they can pin this on Democrats or immigrants in some manner.

[–] FaceDeer@fedia.io 9 points 3 months ago

Thanks for asking. My comment was off the top of my head based on stuff I've read over the years, so first I did a little fact-checking of myself to make sure. There's a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they're capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn't seem to be any reason why synthetic data can't be used for the whole training run, it's just that well-curated high-quality raw data is already available.

This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA's Nemotron-4 models as another example.

[–] FaceDeer@fedia.io 7 points 3 months ago

Raw source data is often used to produce synthetic data. For example, if you're training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

The resulting synthetic data does not contain any of the raw source, but it's still based on that source. That's one way to keep the AI's knowledge well grounded.

It's a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.

[–] FaceDeer@fedia.io 54 points 3 months ago (4 children)

Betteridge's law of headlines.

Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It's done so that the data's format and content can be tailored to optimize its value in the training process. Over the past few years it's become clear that simply dumping raw data from the Internet into LLM training isn't a very good approach. It sufficied to bootstrap AI development but we're kind of past that point now.

Even if there was a problem with training new AIs, that just means that they won't get better until the problem is overcome. It doesn't mean they'll perform "increasingly poorly" because the old models still exist, you can just use those.

But lots of people really don't like AI and want to hear headlines saying it's going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you'll see it raises these sorts of issues itself.

[–] FaceDeer@fedia.io 7 points 3 months ago

Interesting. As poorly as I think of X as an organization, I do hope they follow through with their open system prompt commitment. That's something that other major AI companies should be doing too.

[–] FaceDeer@fedia.io 7 points 3 months ago (1 children)

Could also be malicious compliance on the part of whatever engineer set this up, prompting Grok in such a way that it's making it obvious what's going on under the hood.

[–] FaceDeer@fedia.io 7 points 3 months ago

we hear about crime everywhere.

Worth noting that although concern about crime in the US has risen over time, the actual rate of violent crime has fallen dramatically over the past few decades. As in the overall violent crime rate fell 49% between 1993 and 2022.

I'm not telling you whether your level of concern is appropriate or not, that's up to you and may vary with circumstances that I don't know. But generally speaking I think it's safe to say that levels of concern in the US don't line up very well with the things that the concern is about. Might be worth investigating for yourself and perhaps calibrating your expectations a bit.

[–] FaceDeer@fedia.io 11 points 3 months ago (1 children)

How do you disallow LLMs to train on their data while still allowing humans to train on their data?

[–] FaceDeer@fedia.io 8 points 3 months ago

I know a couple of people in real life who are also on the internet. If the entirety of the internet was a simulation, I'd ask them about stuff they'd posted and they wouldn't have any idea what I'm talking about.

[–] FaceDeer@fedia.io 1 points 3 months ago

Yeah? They generally have plenty of money of their own, the government just pays for a bit of pageantry now and then.

view more: ‹ prev next ›