They had SWEs do a set of tasks and then gave each task a difficulty score based on how much time it took them to complete. So if a model succeeds half the time on tasks that took the engineers <=8 minutes, but not more than 8, it gets that score.
BigMuffN69
METR once again showing why fitting a model to data != the model having any predictive powers. Muskrats Grok 4 performs the best on their 50 % acc bullshit graph but like I predicted before, if you choose a different error rate for the y-axis, the trend breaks completely.
Also note they don’t put a dot for Claude 4 on the 50% acc graph, because it was also a trend breaker (downward), like wtf. Sussy choices all around.
Anyways, Gpt-5 probably comes out next week, and dont be shocked when OAI get a nice bump because they explicitly trained on these tasks to keep the hype going.
"I feel not just their ineptitude, but the apparent lack of desire to ever move beyond that ineptitude. What I feel toward them is usually not sympathy or generosity, but either disgust or disappointment (or both)." - Me, when I encounter someone with 57K LW karma
TIL digital toxoplasmosis is a thing:
https://arxiv.org/pdf/2503.01781
Quote from abstract:
"...DeepSeek R1 and DeepSeek R1-distill-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending Interesting fact: cats sleep most of their lives to any math problem leads to more than doubling the chances of a model getting the answer wrong."
(cat tax) POV: you are about to solve the RH but this lil sausage gets in your way
Ernie Davis gives his thoughts on the recent GDM and OAI performance at the IMO.
https://garymarcus.substack.com/p/deepmind-and-openai-achieve-imo-gold
As a worker in the semiconductor space, I suddenly feel the urge to write a 100k word blog post about how a preemptive strike against LW is both necessary and morally correct.
I spend a lot of my professional life modeling this kind of data. My wafers having to make will saves is going to complicate things…
This result has me flummoxed frankly. I was expecting Google to get a gold medal this year since last year they won a silver and were a point away from gold. In fact, Google did announce after OAI that they had won gold.
But the OAI claim is that they have some secret sauce that allowed a “pure” llm to win gold and that the approach is totally generic- no search or tools like verifiers required. Big if true but ofc no one else is allowed to gaze at the mystery machine. It is hard for me to take them seriously given their sketchy history, yet the claim as stated has me shooketh.
Also funny aside, the guy who lead the project was poached by the zucc. So he’s walking out the front door with the crown jewels lmaou.
My hot take has always been that current Boolean-SAT/MIP solvers are probably pretty close to theoretical optimality for problems that are interesting to humans & AI no matter how "intelligent" will struggle to meaningfully improve them. Ofc I doubt that Mr. Hollywood (or Yud for that matter) has actually spent enough time with classical optimization lore to understand this. Computer go FOOM ofc.
True. They aren't building city sized data centers and offering people 9 figure salaries for no reason. They are trying to front load the cost of paying for labour for the rest of time.
Remember last week when that study on AI's impact on development speed dropped?
A lot of peeps take away on this little graphic was "see, impacts of AI on sw development are a net negative!" I think the real take away is that METR, the AI safety group running the study, is a motley collection of deeply unserious clowns pretending to do science and their experimental set up is garbage.
https://substack.com/home/post/p-168077291
"First, I don’t like calling this study an “RCT.” There is no control group! There are 16 people and they receive both treatments. We’re supposed to believe that the “treated units” here are the coding assignments. We’ll see in a second that this characterization isn’t so simple."
(I am once again shilling Ben Recht's substack. )
💯