AI coding productivity needs better numbers than lines of code

AI coding productivity

AI coding productivity has a measurement problem: vendors keep promoting how much code AI writes, while engineering teams need to know whether better software shipped. David Curlewis argues that “percent of code written by AI” is an old lines-of-code metric with a cleaner press release. The useful question is not whether AI touched the codebase. It is whether delivery, reliability, review quality, and customer outcomes improved.

The short version

  • Google, Anthropic, OpenAI, and Cursor have all promoted large code-volume claims, including 75% to 80% AI-written code and 100 million lines of enterprise code per day.
  • Those numbers show adoption and usage, not whether AI coding productivity improved product delivery or reduced operational cost.
  • Research is mixed: GitHub reported a 55% task-speed gain for Copilot, Cui et al. found a 26% increase in completed tasks, while other work raised concerns about churn, comprehension, and hard-to-measure agentic workflows.
  • Hacker News readers mostly agreed that lines of code are a dangerous target, but argued over whether the real bottleneck has moved to review, testing, product decisions, and organizational design.

What happened

Curlewis’s post is a critique of the way AI coding productivity is now marketed. He points to code-volume claims from major AI vendors: Google saying 75% of new code is AI-generated, Anthropic and OpenAI making roughly 80% claims, and Cursor highlighting more than 100 million lines of enterprise code written per day. His point is simple: these are volume numbers. They do not show whether the code shipped something useful.

The piece also contrasts today’s volume framing with earlier outcome-oriented claims. GitHub’s Copilot study said developers completed a coding task 55% faster. A Management Science paper by Cui and co-authors, based on field experiments with about 5,000 developers, found a 26% increase in completed tasks. Curlewis also cites research and company-level evidence that complicate the story: more churn, less refactoring, weaker code comprehension in some settings, and executive surveys where AI adoption does not yet translate into obvious measured productivity.

That is why this story belongs in the broader developer tools conversation, not only in AI hype coverage. For more technology briefs in this lane, the IT & AI archive tracks similar shifts in tools, platforms, and engineering practice.

Why AI coding productivity is worth watching

AI coding productivity is worth watching because the measurement will shape budgets, hiring plans, vendor contracts, and performance reviews. If a company treats AI-written code percentage as proof of productivity, it can reward activity that creates more review work, more maintenance risk, or more unowned code. A higher code count may be a cost signal before it is an output signal.

The better frame is outcome-based. Engineering teams already have sturdier measures: deployment frequency, lead time for changes, change failure rate, mean time to recovery, incident volume, review latency, customer-facing feature throughput, and revenue or retention impact where attribution is credible. None of those are perfect. They are still harder to game than “AI wrote 80% of the code.”

The article’s strongest point is not that AI coding tools are useless. Curlewis says engineers should use them. The warning is narrower and more practical: adoption is the start of the process, not the scoreboard. Good AI coding productivity measurement should survive a skeptical finance review and a skeptical staff engineer review.

What does AI coding productivity change for builders?

AI coding productivity changes the bottleneck for builders by making code generation cheaper while making judgment more important. Teams can now produce drafts, tests, scaffolding, migrations, and exploratory implementations faster. That helps if the team has enough product clarity, review capacity, test coverage, observability, and ownership to absorb the extra output.

The risk is that teams buy speed at the wrong layer. If product direction is unclear, faster implementation just creates more discarded work. If tests are weak, generated code can raise the review burden. If senior engineers are already the bottleneck, a flood of AI-assisted pull requests may slow them down. The practical builder question is therefore not “how much code did the model write?” It is “which constraint moved, and which constraint became worse?”

Developer tool companies should pay attention too. Raw usage metrics are easy to market, but outcome metrics are harder for competitors to copy. A tool that can show reduced review time, lower rollback rates, better test coverage, or faster safe migrations has a stronger product story than one that only counts tokens or generated lines.

What Hacker News readers are arguing about

The Hacker News discussion around the post was active, with more than 290 comments at check time. The dominant reaction was agreement with the Goodhart’s Law problem: once lines of code become a target, they stop being a useful measure. Several commenters compared AI code-volume bragging to older mistakes around PR counts, test coverage targets, and other metrics that can be inflated without improving the product.

The more useful disagreement was about bottlenecks. One camp argued that writing code has not been the hard part for years; deciding what to build, reviewing changes, testing, and maintaining systems are the constraints. Another camp pushed back that more correct code can still be useful if teams add better quality control, proofs, statistical checks, or automated review. That is a fair caveat, but it also supports the article’s point: the quality system matters more than the raw output count.

There was also skepticism about AI evangelism at the end of the piece. Some readers asked why every engineer should use AI daily if the productivity evidence is still messy. Others answered with practical examples: documentation search, Playwright test generation, boilerplate, and faster exploration. The thread was not anti-AI so much as anti-vanity-metric. Most of the heat was aimed at executives and vendors treating code volume as business value.

The practical read

Treat AI coding productivity as an operating question, not a slogan. If your team is using AI coding tools, keep the adoption metrics internally, but do not confuse them with results. Track whether lead time moved, whether incidents changed, whether reviews got heavier, whether test coverage became more meaningful, and whether users received valuable changes sooner.

For engineering leaders, the cleanest dashboard probably combines DORA-style delivery metrics with code review health, defect rates, and a small set of product outcomes. For individual developers, the test is simpler: use AI where it removes drudgery, then review the result as if a fast junior engineer wrote it. The model can speed up typing, scaffolding, and search. It cannot decide what is worth owning for the next five years.

The best version of AI coding productivity will not be measured in generated lines. It will be measured in smaller queues, safer releases, clearer ownership, and software that solves more real problems with less maintenance drag.

Sources