sofiechan home

Study claims -20% productivity loss from use of AI tools. Huge if true.

anon_puqy said in #3749 2w ago: received

https://x.com/METR_Evals/status/1943360399220388093

They tried to measure how much LLM assistance actually speeds up technical work but it came out negative! Programmers thought they would get +20%, they actually got -20%. What do you guys make of this?

The setup seems to be that they used professional programmers working on existing issues on github repos, and randomly assigned either having access to LLM tools or not. If there's nothing I'm missing, that's a pretty good methodology.

But this doesn't fit with all the vibe coding hype (which actually seems to be dying down if I'm not mistaken). Vibe coders claim huge speedups and new capabilities, but jonathan blow keeps challenging them to show impressive software coded by LLM assistance and say how long it took, and afaik, no one has taken him up on that. Meanwhile YC and big companies are apparently all in on vibe coding. What's going on?

I've met some people who apparently use LLM assistance effectively and swear it massively speeds up their ability to create complex apps. But others, while claiming similar things, seem to be engaged in a sort of ecstatic gnosis of the sort you see with psychedelic apologists. Does LLM use similarly dull your ability to see the reality of what you're actually doing?

In my own experience, LLM tools are great for looking up algorithms and thinking things through using it as a rubber duck who knows the literature and can do some math drudge work. But they noticeably have no ability to make tasteful tradeoffs, and just think everything is brilliant. So often what seems like a good idea while talking to an LLM turns out to be a bad idea on less technologized reflection. For coding, it's nice to have them spit out some boilerplate SQL interface code of the sort that shouldn't exist in the first place, but that's always twice as big as it should be, and often filled with subtle bugs. As soon as you get into subtle algorithms or making particular changes to complex systems, they are basically useless. They make too many assumptions and run off to do more stuff than can be understood, much of which breaks existing system assumptions or doesn't work. When you give them sufficiently detailed instructions that they don't trip up, you might as well write it yourself at that point. I've stopped using the coding agent type stuff because it was just too frustrating and it tended to be easier to just do it myself. I still use autocomplete, which seems like a boost, but this study has me questioning.

What's your experience, and do you think this study is sound? How well does the result generalize?

They tried to measur received

judges said in #3755 2w ago: received

>do you think this study is sound? How well does the result generalize?

The study is never sound. The results never generalize.

First of all, the study is done on 16 developers. That's a tiny, tiny sample. If you're looking for "do parachutes improve longevity when you jump out of an airplane"-type effects, this isn't necessarily a problem, but if you're looking for anything that requires actual statistics to tease out then you're fucked.

In Twitter discussion (https://x.com/eshear/status/1944895440224501793), Shear points out that METR counted "experienced devs" as people who had used completely different LLM tools, and argues that using those well is a completely different skillset from using Cursor, the software used in the study. They did also separately check Cursor experience, and apparently the single person who was recorded as experienced *with Cursor specifically* saw their speed increase.

But then Bloom, one of the study participants chimes in (https://x.com/ruben_bloom/status/1944933334569902300). He claims he's actually somewhat experienced with Cursor, despite how he's recorded in the paper. Apparently he told METR's survey people that he had "10-100" hours of experience with Cursor; it's not clear whether he was then put in METR's "1-10 hours" bucket, "10-30 hours", or "30-50 hours". Evidently they didn't put him in the ">50 hours" bucket. (I'd blindly guess they decided to be conservative and pegged Bloom at 10, but who knows.) On Twitter Bloom claims the real figure is more like 150.

Which figure is accurate? How should METR have coded a response when a participant gives an answer spanning four of their five buckets? Who the hell knows! The upshot is, of the data points we have public info on, 1 out of 1 looks like a random artifact of survey wording and arbitrary choices about how to record ambiguous responses, and plausibly just outright false based on the participant totally misremembering, rather than anything about the underlying phenomenon they're trying to study. And remember, Bloom is 1/16th of the entire population studied here! Even if the other fifteen were perfectly clear and it's bad luck that the only one we know about is the one that was fucked (lol no it's not, they're all this bad), that's still gonna totally sink your conclusions.

Survey design is hard. Most surveys give you garbage, apparently including this one. The unbreakable rule of epistemology is "garbage in, garbage out".

I'm also very skeptical of the vibe coding hype, but this study doesn't tell us anything about it.

referenced by: >>3781

The study is never s received

anon_pupa said in #3779 2w ago: received

referenced by: >>3781

A statistical critiq received

anon_puqy said in #3781 2w ago: received

>>3755
>>3779
Heh. Good stuff guys. "nothing ever replicates" is the null hypothesis, and this seems to fail to falsify it.

Heh. Good stuff guys received

anon_kali said in #3783 2w ago: received

Do we need a study to really know this? Search your feelings, you know it to be true

Do we need a study t received

You must login to post.