The METR study, published in July 2025, is the first rigorous independent measurement of AI coding tools with experienced developers working on real codebases. The result: developers using Cursor Pro with Claude 3.5/3.7 Sonnet completed tasks 19% slower than those working without AI assistance.
Not 19% faster. Slower.
Sixteen experienced open-source developers. 246 real issues. Mature projects averaging over a million lines of code each. These weren't synthetic benchmarks or toy problems; they were actual GitHub issues on production codebases with 22,000+ stars.
Before the study, developers predicted AI would give them a 24% speedup. After completing their tasks, they still believed they'd gotten a 20% speedup, even though objective measurement showed the opposite.
The gap between feeling fast and being fast
That perception-reality mismatch is the most important finding in the study. Developers aren't lying. They genuinely feel faster. The tools reduce friction in ways that register as productive: less typing, faster first drafts, fewer blank-page moments. But those psychological benefits don't translate to faster task completion.
The METR researchers identified five friction sources that compound invisibly:
- Prompting time: Formulating requests, providing context, iterating on unclear outputs
- Review overhead: Reading and validating AI suggestions takes longer than writing equivalent code yourself
- Integration complexity: AI-generated code often needs significant rework to fit existing patterns
- Context-switching: Moving between "directing the AI" and "writing code" modes disrupts flow
- The "almost correct" problem: Over 56% of AI suggestions were rejected, turning the tool into overhead rather than acceleration
56% of AI-generated code gets discarded. When you're throwing away more than half the output, you're not using an accelerator; you're operating a suggestion engine with a high false-positive rate.
If the METR findings seem to contradict everything you've heard about AI coding productivity, that's because vendor-funded research shows dramatically different results. GitHub's study found developers completed tasks 55% faster with Copilot. Google's internal research showed 21% faster task completion. A multi-company study across Microsoft and Accenture reported 26% productivity gains.
So who's right?
The methodological differences explain most of the gap. Vendor studies typically use isolated tasks (GitHub's study had developers write an HTTP server in JavaScript from scratch), often with junior developers, writing greenfield code with no existing constraints. METR measured experienced developers navigating complex existing systems with production-level requirements. This isn't to say vendor research is worthless. It measures something real: for isolated, well-defined tasks with clear specifications, AI tools provide genuine acceleration. The question is whether that maps to how professional developers actually spend their time.
Experience flips the equation
One finding appears consistently across studies: junior developers benefit more than seniors.
According to Addy Osmani's synthesis of multiple research sources, junior developers see 35-39% productivity improvements, while senior developers see only 8-16% gains, and sometimes slowdowns. This makes intuitive sense. AI tools are particularly good at generating boilerplate, handling unfamiliar syntax, and providing starting points. These are exactly the tasks that slow down inexperienced developers. Senior developers already have this knowledge cached; for them, the AI is more likely to produce suggestions that don't match their mental model of the codebase.
Even when individual developers complete more tasks, organizational productivity tells a different story. The 2025 DORA/Faros report (as cited in Osmani's analysis) found that teams with heavy AI use completed 21% more tasks, but code review times increased 91%. PR sizes grew 154%. Bug rates rose 9% per developer.
Our read: developers are shipping more code, but that code requires more review effort and contains more defects. The productivity gains at the individual level don't translate to team outcomes. Larger PRs with more bugs that take twice as long to review isn't obviously an improvement.
Adoption up, trust down
The 2025 Stack Overflow Developer Survey reveals a curious pattern.
84% of developers now use or plan to use AI coding tools (up from 76% in 2024). But only 60% view them favorably, down from over 70% in 2023. 46% actively distrust AI accuracy, compared to just 33% who trust it. Only 3% report "highly trusting" AI output.
The top frustration, cited by 66% of developers: "solutions that are almost right, but not quite."
This aligns perfectly with METR's finding about the 56% rejection rate. Developers have learned through experience that AI suggestions require careful validation. Just 4.4% of surveyed developers believe AI handles complex tasks "very well." The experienced developers in the survey were the most skeptical, with 20% reporting they "highly distrust" AI output.
None of this means AI coding tools are useless. The research points to clear use cases where they provide genuine value: onboarding to unfamiliar codebases through AI-assisted exploration, boilerplate and repetitive code (the original Copilot use case still works), learning new languages or frameworks with AI as a patient tutor, junior developers who benefit most from scaffolding, and writing tests, particularly straightforward unit tests.
The problem isn't the tools. It's the claims.
"10x productivity" doesn't survive contact with rigorous measurement. If you're using AI coding tools and finding them valuable, keep using them. Subjective experience isn't nothing. If you're in flow, producing code, and shipping features, the tools are working for you regardless of what any study says.
But be skeptical of productivity claims, especially from vendors. AI hasn't solved the measurement problem in software development; it's made it worse. When someone tells you their tool delivers 50% productivity gains, ask: measured how, on what tasks, with which developers?
If you're a manager evaluating AI tool adoption, remember the organizational findings: individual task completion doesn't map directly to team productivity. Your developers may complete more tasks while your overall output stays flat or declines.
The honest assessment: AI coding tools provide real value in specific contexts, genuine psychological benefits that improve the developer experience, and modest measurable productivity gains for some use cases. They do not deliver the transformational improvements that marketing materials promise.
That's less exciting than "10x engineer" narratives. It's also closer to the truth.