this post was submitted on 24 Feb 2025
55 points (98.2% liked)

Hacker News

2230 readers
332 users here now

Posts from the RSS Feed of HackerNews.

The feed sometimes contains ads and posts that have been removed by the mod team at HN.

founded 10 months ago
MODERATORS
top 5 comments
sorted by: hot top controversial new old
[–] gencha@lemm.ee 14 points 5 months ago (1 children)

How the fuck is it that Microsoft and OpenAI are the two companies releasing so much research that shows the technology they sell is a scam?

[–] Ghyste@sh.itjust.works 6 points 5 months ago

Guessing they show issues so they can then show progress fixing it. With this being a new field the first real successes translate straight to sales.

[–] chicken@lemmy.dbzer0.com 8 points 5 months ago (1 children)

Results indicate that the real-world freelance work in our benchmark remains challenging for frontier language mod- els. The best performing model, Claude 3.5 Sonnet, earns $208,050 on the SWE-Lancer Diamond set and resolves 26.2% of IC SWE issues; however, the majority of its so- lutions are incorrect, and higher reliability is needed for trustworthy deployment

That's still kind of crazy if it can actually do some meaningful portion of real world software jobs by itself.

[–] Jesus_666@lemmy.world 6 points 5 months ago

With the caveat that the majority of its "solutions" are wrong. So it generates output that looks plausible enough to be accepted as an answer but is not exactly correct. That's pretty much on par for LLMs.

The lack of precision may be acceptable for a chatbot or a summarizer. But for coding you need precision and that's something LLMs don't offer.

[–] lvxferre@mander.xyz 2 points 5 months ago

Give those great people a prize for such amazing discovery. Preferably an Ig Nobel prize.