Tag: Software Engineering

  • MAI-Code-1-Flash puts Microsoft’s own coding model inside Copilot

    MAI-Code-1-Flash puts Microsoft’s own coding model inside Copilot

    MAI-Code-1-Flash is Microsoft’s new coding model for GitHub Copilot, built for fast day-to-day developer assistance rather than frontier-model demos. Microsoft says the model is rolling out to Copilot individual users in Visual Studio Code through the model picker and the default Auto picker.

    The short version

    • Microsoft built MAI-Code-1-Flash end to end for Copilot, using clean and appropriately licensed data, according to the company announcement.
    • The company reports 51.2% on SWE-Bench Pro, compared with 35.2% for Claude Haiku 4.5, plus higher scores on SWE-Bench Verified, SWE-Bench Multilingual, Terminal Bench 2, and IF Bench.
    • The model is tuned to spend fewer tokens on simple requests and more reasoning budget on complex coding tasks, which matters for latency, cost, and Copilot’s product margins.
    • Microsoft’s own adversarial reasoning test shows gaps: MAI-Code-1-Flash reached 85.8% adjusted accuracy overall, while some trap categories stayed below 50%.
    • The Hacker News discussion centered on price, speed, benchmark trust, and whether a small Copilot model is useful if it is not open weight.

    What happened

    Microsoft introduced MAI-Code-1-Flash on June 2, 2026 as a coding model designed for GitHub Copilot workflows. The announcement describes the model as trained for repository question answering, refactoring, software engineering tasks, and Copilot-derived evaluations rather than generic chat alone.

    The placement matters. GitHub Copilot already sits inside the IDE for many developers, so Microsoft does not need MAI-Code-1-Flash to win every public benchmark to make it useful. A model that is fast, cheap enough to call repeatedly, and good at common code edits can still improve the product if Copilot routes the right work to it.

    For readers tracking AI tooling, this fits the broader move toward specialized models inside products. The public model choice may look simple, but the product can route a request through different models depending on task shape, expected cost, and latency. That is also why this story belongs with other IT & AI archive coverage of developer tools rather than only model leaderboard news.

    Why MAI-Code-1-Flash is worth watching

    MAI-Code-1-Flash is worth watching because Microsoft is moving model selection closer to the product layer. Copilot can choose a Microsoft-built model for ordinary coding help while still reserving larger or more expensive models for harder tasks. That makes the model less of a standalone chatbot launch and more of an infrastructure choice inside a paid developer tool.

    Microsoft’s numbers frame the model as efficient rather than maximal. The company says MAI-Code-1-Flash solved harder SWE-Bench Verified problems using up to 60% fewer tokens. It also claims a 16-point lead over Claude Haiku 4.5 on SWE-Bench Pro, with 51.2% versus 35.2%.

    Those claims need context. Haiku is Anthropic’s smaller model line, not its most capable coding model. The useful question is whether MAI-Code-1-Flash gives Copilot a better default for frequent, lower-cost tasks such as local edits, refactors, command-driven fixes, and repository-aware explanations.

    What does MAI-Code-1-Flash change for developers?

    MAI-Code-1-Flash changes the Copilot experience only if Microsoft can make model routing feel boring in a good way. Developers usually do not want to think about which small model should answer a lint fix, which model should inspect a repository, and which one should spend more tokens on a multi-file change. Copilot’s Auto picker can hide that decision when the routing is good.

    The risk is that benchmark performance does not map cleanly to working code. Microsoft’s adversarial evaluation is a useful warning: the model scored 85.8% adjusted accuracy across 186 questions and 34 categories, but fell below 50% on some trap types such as Einstellung-style problems. In practice, teams should treat MAI-Code-1-Flash as a fast assistant for contained tasks, not as a reason to weaken tests or review.

    For app and tool builders, the product angle may matter more than the model card. If Copilot can make specialized model routing normal inside VS Code, other developer tools will face pressure to offer similar model pickers, agent modes, and cost-aware routing.

    What Hacker News readers are arguing about

    The Hacker News discussion was less impressed by the headline benchmark than by the economics behind it. Several commenters asked for tokens-per-second and price-per-token numbers, arguing that an “efficient” coding model is hard to judge without latency and pricing. One practical objection was simple: developers care about price, performance, and latency together, not token count as an implementation detail.

    Another thread focused on benchmark trust. Some readers questioned whether the model had been tuned too closely against SWE-Bench-style tasks, while others pointed to Microsoft’s decontamination language and model-card material. The thread did not settle the issue, but the skepticism is useful. Coding benchmarks can be gamed, and even honest benchmark gains may not predict whether the assistant helps on messy internal repositories.

    The split on small models was more interesting. Some commenters saw MAI-Code-1-Flash as evidence that specialized small or mixture-of-experts models will handle more work locally or cheaply. Others pushed back that state-of-the-art models will keep growing because the target tasks will grow too. There was also disappointment that the model does not appear to be open weight, especially given Microsoft’s history with Phi.

    The practical read

    MAI-Code-1-Flash should be judged as a Copilot routing model, not as a replacement for Claude, GPT, or other high-end coding agents. The right test is whether it makes common IDE work faster without making developers babysit wrong patches.

    For individual developers, the first useful experiment is narrow: try MAI-Code-1-Flash on refactors, small bug fixes, repository Q&A, and terminal-driven cleanup tasks. Check whether it stays concise on simple requests and whether it asks for context when a task is underspecified.

    For engineering teams, the adoption question is about guardrails. Keep tests, code review, and permission boundaries in place. Track whether the model reduces repeated small edits or simply moves review effort later in the workflow. If Copilot’s Auto picker improves, most developers may never care which model answered. If routing is noisy, the model picker becomes another thing to manage.

    The broader read is that Microsoft wants more control over the cost and behavior of coding assistance inside its own developer platform. MAI-Code-1-Flash gives the company a way to tune Copilot around real IDE usage, not only around whichever third-party model is available at a given price.

    Sources

  • Claude Code dynamic workflows make agents plan the work

    Claude Code dynamic workflows make agents plan the work

    Claude Code dynamic workflows let Claude Code write a task-specific JavaScript harness, spawn subagents, and coordinate the result instead of keeping a long job in one chat thread. Anthropic introduced the feature on June 2, 2026, and frames it as a way to handle complex coding, research, security, triage, and verification work without forcing developers to build the orchestration layer by hand.

    The short version

    • Claude Code dynamic workflows create custom harnesses for a task, then use subagents to split, verify, compare, or synthesize work.
    • Anthropic names seven useful patterns: classify-and-act, fan-out-and-synthesize, adversarial verification, generate-and-filter, tournament, loop until done, and model routing.
    • The feature is aimed at complex, high-value jobs such as refactors, migrations, deep research, source checking, support triage, and root-cause analysis.
    • The trade-off is cost and complexity. Anthropic says dynamic workflows can use significantly more tokens and are not needed for ordinary coding tasks.

    What happened

    Anthropic says Claude Code can now create a custom harness on the fly for the job in front of it. The harness is a JavaScript file with special functions for spawning and coordinating subagents, plus ordinary JavaScript utilities such as JSON, Math, and Array for processing data. A workflow can choose which model an agent uses and whether subagents run in their own worktree, which matters when a task needs isolation or a higher intelligence model.

    The company’s post describes this as a move beyond static orchestration. Developers could already coordinate multiple Claude Code runs through the Claude Agent SDK or claude -p, but those static harnesses tend to be generic because they have to survive many edge cases. Dynamic workflows push more of that planning into Claude Code itself: ask for a workflow, or use Anthropic’s trigger word “ultracode,” and Claude Code can build a structure for the current task.

    Why this is worth watching

    Claude Code dynamic workflows are worth watching because Anthropic is moving Claude Code from a single assistant loop toward task-level orchestration. In the June 2, 2026 post, Anthropic names three failure modes that show up in long agent runs: agentic laziness, self-preferential bias, and goal drift. Those are practical problems, not abstract benchmark issues.

    A separate harness gives Claude Code a cleaner way to check work against evidence and rubrics. One subagent can inspect logs, another can review files, another can verify claims, and a synthesis step can wait until each branch returns structured output. The feature will matter if that structure reduces missed requirements more often than it burns extra tokens. For more analysis of developer tooling and AI systems, see the IT & AI archive.

    What does Claude Code dynamic workflows change for developers?

    Claude Code dynamic workflows let developers request a repeatable process with a stop condition, a rubric, and isolated work streams. Anthropic’s examples include reproducing a flaky test that fails 1 in 50 runs, mining the last 50 Claude Code sessions for repeated corrections, checking every technical claim in a draft against a codebase, ranking 80 resumes, and reviewing a business plan from investor, customer, and competitor viewpoints.

    The strongest fit is work where one context window becomes a liability. Large refactors can be split by call site, module, or failing test. Security reviews can assign one verifier per rule. Research workflows can fan out source gathering and then check claims. Triage workflows can classify a backlog, dedupe it against known issues, and quarantine agents that read untrusted public content from agents that can take higher privilege actions.

    Seven workflow patterns Anthropic highlights

    Anthropic’s seven workflow patterns turn Claude Code dynamic workflows into something developers can prompt deliberately. Classify-and-act routes different tasks to different behavior. Fan-out-and-synthesize splits work into clean contexts and merges structured outputs after a barrier. Adversarial verification asks another agent to check a result against a rubric. Generate-and-filter produces candidates, removes duplicates, and keeps the best tested ideas.

    The remaining patterns handle comparison, persistence, and model choice. Tournament workflows make agents compete on the same task and use judging agents for pairwise comparisons. Loop-until-done workflows keep spawning work until no new findings or errors remain. Model and intelligence routing uses a classifier agent to decide whether a job needs a cheaper model or a stronger one such as Opus. The pattern list gives teams concrete language to use instead of vague prompts like “be thorough.”

    When not to use Claude Code dynamic workflows

    Claude Code dynamic workflows should not become the default for every prompt. Anthropic says the feature is new, best practices are still developing, and workflows may consume significantly more tokens. Most normal coding tasks do not need five reviewers, a tournament bracket, or a loop that keeps running until a broad condition is met.

    A good rule is to reserve workflows for jobs where the structure is part of the value. Use them when the task needs parallel evidence gathering, adversarial checking, repeated passes, isolated worktrees, or qualitative comparison at scale. Skip them for a small bug fix, a one-file change, or a question where a normal Claude Code session can answer cleanly. Token budgets can also be set directly in the prompt, such as asking the workflow to stay under 10,000 tokens.

    What Hacker News readers are arguing about

    The Hacker News submission for Anthropic’s post existed when checked, but it had no substantive discussion attached to it. That means there is no useful community consensus to summarize yet, and it would be misleading to turn a quiet thread into a debate.

    The missing discussion is still worth noting. The questions developers should bring to a fuller thread are predictable: whether dynamic workflows are reliable enough for real codebases, how often they waste tokens, how safe the worktree isolation is, whether adversarial verification catches real mistakes, and whether teams can share reusable workflows without turning them into brittle scripts. Treat the Hacker News link as a place to watch for later operator feedback, not as evidence today.

    The practical read

    Claude Code dynamic workflows are best understood as an orchestration feature for messy work. If your team already knows how to decompose a task, the feature may remove boilerplate around spawning agents and combining results. If your team does not know the right rubric, stop condition, or trust boundary, the workflow can still produce confident noise.

    The first experiments should be bounded. Try a flaky-test reproduction, a code review checklist, a migration with isolated worktrees, or a claim-verification pass on a technical document. Give Claude Code the workflow pattern you want, the token budget, the stop condition, and the rubric for success. Then inspect the transcript and saved workflow before using it on a higher-stakes job.

    Sources

  • AI technical interviews need a reset, not a chatbot test

    AI technical interviews need a reset, not a chatbot test

    AI technical interviews are getting harder to design because coding assistants can now help with the exact artifacts companies used to treat as evidence. A polished take-home project no longer tells you as much about how a candidate thinks. The better question is whether the interview still exposes reasoning, review judgment, and the ability to finish one messy problem without hiding behind a model.

    The short version

    • Charles-Axel Dein argues that most companies should keep AI out of technical interviews unless the exercise is explicitly about AI use.
    • Take-home coding challenges are the weakest signal now because candidates can generate strong-looking submissions faster than interviewers can review them.
    • Live exercises, follow-up changes, and review-style questions still give companies a better look at how a candidate reasons under constraint.
    • AI fluency matters at work, but the piece treats it as an instrumental skill rather than the foundation of engineering judgment.
    • Anthropic’s own candidate guidance makes a similar split: AI can help with preparation and refinement, while take-home assessments and live interviews are usually meant to show the candidate’s own thinking.

    What happened

    Charles-Axel Dein published an essay on how companies should adapt engineering interviews as AI coding tools improve. His core recommendation is blunt: do not let AI use become the default in most interviews, and do not turn the process into a contest over who has the best prompts.

    The essay breaks interview design into two practical dimensions: signal quality and company cost. A good interview should reveal the traits the role actually needs, while staying cheap enough to run, calibrate, and explain to candidates. AI pushes on both sides. It can make a take-home challenge easier for the candidate, but it can also leave the company with more code to inspect and less confidence about who made the important decisions.

    The piece is not anti-tooling. Dein’s sharper point is that AI skill is closer to editor fluency or language familiarity than to engineering judgment. You can teach a strong engineer a new tool. It is much harder to teach the habit of breaking down ambiguous requirements, spotting risk in a codebase, or explaining why a design will fail.

    Why this is worth watching

    AI technical interviews are now a hiring product problem, not only an engineering culture debate. A company has to decide what it is actually buying with each interview round: implementation speed, reasoning, communication, review quality, integrity, or all of those at different points in the funnel.

    That matters because the old take-home model is becoming expensive in a strange way. The candidate can produce more. The company must verify more. If the review loop turns into “AI wrote it, AI graded it, and a human checked both,” the process has not saved much work. It may have added another layer of uncertainty.

    The useful move is to separate tool use from fundamentals. Let candidates prepare with AI if that matches normal work. Be explicit when AI is allowed. But keep at least part of the process focused on human reasoning: explain the tradeoff, modify the solution live, critique an AI-generated plan, review a small codebase, or walk through a product requirement that has gaps.

    For readers tracking developer tools and hiring workflows, this is also a market signal. Interview platforms, coding assessment vendors, and AI IDEs will all be pulled into the same question: are they helping teams see better evidence, or just producing cleaner artifacts? The IT & AI archive tracks similar shifts where AI tools change the workflow before teams agree on the evaluation rules.

    What Hacker News readers are arguing about

    The Hacker News submission for the essay exists, but it has no meaningful comment thread at the time of writing. That silence is useful in a small way: this is not a case where a loud thread can be treated as community consensus.

    The discussion worth having is still clear. One camp will argue that banning AI in interviews creates an artificial test because real engineers use tools. The stronger reply is that interviews are already artificial; the point is to isolate a signal. Companies do not ban calculators in every job because arithmetic is sacred. They ban them in some tests when the goal is to see whether the person understands the underlying operation.

    The builder argument cuts the other way. If the job requires daily collaboration with AI agents, a company should test that workflow directly. The problem is making it the whole interview. A candidate who can drive a model well but cannot detect a flawed assumption is still a risky hire.

    The practical read

    Companies should stop treating “AI allowed” as a yes-or-no policy and make it a per-stage rule. Use AI freely for application polish and interview preparation. For take-home work, either forbid it clearly or allow it and make the live follow-up do the real evaluation. For live interviews, keep at least one round where the candidate has to reason without outside assistance.

    The most practical interview formats are review-heavy. Ask candidates to inspect an AI-generated plan, find bugs in an existing implementation, respond to a changed requirement, or explain what they would delete from a proposed architecture. Those tasks map better to how AI-assisted engineering actually feels: less typing from scratch, more judgment under uncertainty.

    For candidates, the lesson is simple. Being good with AI tools helps, but it does not replace the basics. You still need to understand the code well enough to defend it, change it, and catch the part where the model sounded confident and got the problem wrong.

    AI technical interviews in practice

    A useful hiring loop should state the AI rule for each stage, then test the candidate’s own judgment somewhere in the process. That is the part a cleaner code sample cannot prove on its own.

    Sources

  • LLM oriented engineering puts human context first

    LLM oriented engineering puts human context first

    LLM oriented engineering is less about making models write more code and more about protecting the parts of software work that still need human judgment. Yair Weinberger, writing from his work at Reindeer, argues that the scarce resource in AI-assisted teams is not typing speed. It is human context: the time and attention needed to understand architecture, say no to bad API changes, and keep generated work from spreading through the codebase.

    The short version

    • Weinberger frames human attention as the real bottleneck: LLMs can produce code, comments, documents, and PRs faster than people can read them.
    • His practical answer is stricter modeling discipline, especially around APIs and component boundaries.
    • Human code review alone does not scale when AI-generated pull requests grow, so teams need linters, LLM judges, tests, and smaller PRs.
    • PMs can use LLMs to prototype in isolated repositories, but product ideas that touch customers still need a slower modeling path before they reach production.
    • The sharpest claim is that AI multiplies both good and bad engineering habits. Weak structure now turns into debt faster.

    What happened

    Weinberger published a long X post under the phrase “LLM Oriented Engineering,” based on roughly 18 months of thinking about how Reindeer builds product in the LLM era. The post is not a tooling launch or a benchmark. It is a working theory for how a software organization should behave once generated code, documents, and PR descriptions become cheap.

    The starting point is simple: people have limited context windows too. If LLMs fill the organization with bloated comments, verbose documents, and sprawling pull requests, the next human reviewer gets less signal. Then the next model reads that noisy context and copies the pattern.

    That is why Weinberger puts modeling at the center. Translating a customer user journey into API flows, components, and boundaries is still human work. A model can add a convenient field to an API in seconds. The team may then have to support that field as a public contract for years.

    Why this is worth watching

    A lot of AI coding discussion still treats productivity as the main question. The more interesting question is what happens after productivity rises. LLM oriented engineering gives that problem a name: the team does not run out of code, it runs out of readable context.

    The post also pushes back on the idea that review can stay mostly human. Weinberger’s view is blunt: people cannot beat LLM output volume by reading harder. Absolute rules, such as forbidden service dependencies, belong in linters. Softer contracts can be checked by LLM judges on clean context. Humans should spend their attention on modeling changes, API changes, and other load-bearing decisions.

    One useful phrase from the post is “padded rooms.” These are parts of the system where LLMs can move fast because mistakes do not create long-term dependencies. Customer-specific work and experiments can live there. Core architecture should not.

    That distinction matters for anyone building coding agents or developer tooling. The product does not only need a better autocomplete loop. It needs workflows that separate throwaway experiments from production contracts, and it needs review surfaces that make human attention easier to spend. For more coverage of AI and developer tools, the IT & AI archive is the closest internal reference point.

    What the discussion is missing

    I could not find a matching Hacker News thread for this specific post, so there is no public HN argument to summarize. The missing debate is still obvious enough: Weinberger is describing a company that already has a strong internal engineering culture, strong tests, and enough discipline to keep prototypes away from production.

    That is the hard part to generalize. A small team can say “use padded rooms” and still let customer work leak into core code because the customer is loud, the deadline is real, and the AI-generated patch appears to work. A larger team can add LLM judges and still end up trusting a model that checks the wrong thing.

    The post would be stronger with concrete examples of the enforcement layer: what a useful LLM judge prompt checks, what gets blocked by linters, and how the team decides that an API change is load-bearing enough for human review. Without those examples, the argument is directionally useful but still a playbook outline.

    LLM oriented engineering, in practice

    There are five habits worth pulling out of the post.

    First, keep organizational text tight. If a comment or PR description explains history instead of the result, it probably costs more attention than it saves.

    Second, treat APIs as contracts. A field that helps one generated patch can become a long-running support burden.

    Third, make pull requests small enough to read. If a reviewer cannot hold the change in their head, the approval is mostly theater.

    Fourth, invest in reward functions. In software work, that means useful tests, end-to-end coverage where it matters, evals for LLM-backed features, and automated review that starts from clean context.

    Fifth, isolate experiments. Let PMs and agents build fast demos, but make production adoption a separate modeling decision.

    None of this is glamorous. That is the point. LLM oriented engineering is not a new layer of magic on top of software teams. It is old engineering hygiene under much higher output pressure.

    The practical read

    If your team is adopting coding agents, start by mapping which parts of your codebase are load-bearing. APIs, shared data models, permission boundaries, and core workflows should get slower review. UI experiments, customer-specific adapters, and disposable prototypes can move faster if they stay isolated.

    Then look at the review burden. If AI has made PRs bigger, comments longer, and docs noisier, you have not gained as much leverage as it looks. You have moved work from typing to comprehension.

    The practical test is simple: can a new engineer, or a clean-context review agent, understand why the system is shaped the way it is? If not, more generated code will make the team feel faster while making the product harder to change.

    Sources

  • Domain expertise is the AI coding moat

    Domain expertise is the AI coding moat

    Domain expertise is becoming more valuable as AI coding agents make software easier to produce. Aaron Brethorst’s argument is simple and uncomfortable: the bottleneck moves from writing the code to knowing whether the thing the code does is correct.

    The short version: domain expertise

    • AI coding agents lower the cost of implementation, but they do not automatically know the messy rules inside payroll, transit, insurance, logistics, or clinical billing.
    • Domain expertise matters because the expert can spot a plausible answer that is wrong before it turns into a costly system.
    • The strongest engineer in this setup is not the fastest prompt writer. It is the person who can judge the code and the real-world result.
    • Hacker News readers mostly agreed with the premise, but pushed back on the idea that domain experts can easily explain their own rules to an AI system.

    What happened

    Brethorst’s essay argues that software has always depended on a mental model of the domain. A payroll system is hard because of garnishments, deductions, rate changes, and edge cases. A transit app is hard because routes, trips, schedules, and rider expectations do not line up cleanly.

    In that view, code is the transcription layer. The harder work is learning enough of the domain to know what the software should do.

    AI coding agents weaken the old link between understanding and implementation. A person can now ask an agent to build screens, APIs, tests, and deployment scripts without years of programming practice. That helps domain experts, because the missing piece for many of them was code production. It does less for a generalist engineer who lacks the domain model and cannot tell whether a generated output is actually right.

    That distinction matters for teams following AI and software engineering closely in the IT & AI archive. Faster output is useful only when the organization has someone who can define and verify correctness.

    Why this is worth watching

    The essay lands because it pushes against a lazy version of the AI coding story. If code gets cheaper, the valuable work does not disappear. It moves closer to judgment.

    A logistics dispatcher may not read a stack trace, but they can look at a generated schedule and know that a driver cannot legally work that shift. A clinical coder may not care how the rules engine is structured, but they can see when a claim is likely to be denied. That is not generic “business context.” It is accumulated pattern recognition from years of seeing inputs, outputs, exceptions, and consequences.

    This is also a career argument. Senior developers still need architecture, reliability, testing, and incident judgment. But if their only advantage is turning clear requirements into clean code, that advantage is getting thinner. The rarer combination is engineering skill plus a working model of a real domain.

    For product teams, the practical question is where domain expertise sits in the AI workflow. If experts only review the product after engineers and agents have already built it, the process will keep producing polished wrong answers. The expert needs to shape tests, examples, acceptance criteria, and failure cases early.

    What Hacker News readers are arguing about

    The Hacker News discussion was less about whether domain expertise matters and more about whether domain experts can make their knowledge explicit enough for software.

    One strong objection was that verifying an answer is different from explaining how to generate it. Several commenters who had worked with finance or accounting teams said experts often know a rule when they see it, but struggle to describe it fully. That led to a useful thread around tacit knowledge and Polanyi’s paradox: people can know more than they can explain.

    Another camp argued that requirements work has always been the real software job. In small companies and internal systems, refining what the system should do often takes more time than writing the code. AI may make this more obvious rather than make it new.

    There was also a builder-friendly angle. Some commenters said AI can help engineers learn a domain faster because it removes boilerplate and lets them build experiments quickly. A few mentioned domain-specific languages as a better bridge: instead of expecting experts to write software, give them a constrained language that encodes the rules and can be tested against past cases.

    The useful skepticism is this: domain experts are not automatically good product designers, requirements writers, or system builders. The win probably comes from tighter collaboration, where experts supply examples and corrections while engineers turn that knowledge into reliable systems.

    The practical read

    If you run an engineering team, do not measure AI coding only by tickets closed or lines generated. Add domain validation to the workflow. Ask who owns the examples, who writes the edge-case tests, and who can reject a result that looks reasonable but fails a real rule.

    If you are a developer, the career move is not to panic about code generation. Pick a domain where mistakes matter and learn it seriously. Billing, compliance, logistics, security operations, financial workflows, health care administration, industrial systems, and public-sector processes all have rules that are hard to fake.

    The near-term advantage belongs to people who can ask an AI agent for working software, then say with evidence whether the output is correct. Domain expertise is the moat because correctness is still tied to the world outside the editor.

    Sources

  • Cursor Developer Habits Report shows AI coding is changing shape

    Cursor Developer Habits Report shows AI coding is changing shape

    Source: The Cursor Developer Habits Report

    AI coding tools are no longer just making autocomplete feel smarter. Cursor’s Spring 2026 Developer Habits Report points to something messier: more code, larger PRs, deeper agent sessions, and a widening gap between casual users and people who have turned agents into a real workflow.

    The short version

    • The Cursor Developer Habits Report says lines added per developer per week rose from 3.6K in early 2025 to 8.6K by May 2026.
    • PRs are getting much larger. The p75 lines added per PR moved from 125.86 to 345.02.
    • Big PRs are less rare now: merged PRs with at least 1,000 changed lines rose from 8.0% to 13.8%.
    • AI usage is concentrated. Cursor reports Gini scores of 0.77 for AI lines, 0.75 for AI spend, and 0.72 for token consumption.
    • The input/output token ratio rose from 4.52× to 11.41×, which means agents are reading far more before they write.

    What happened

    Cursor published a product-data report on how developers are using AI inside its coding environment. The headline number is easy to understand: developers are adding more code. But the more useful signal is that the unit of work is getting bigger.

    Lines added per developer per week rose from 3.6K to 8.6K. That is a big jump. It is also a dangerous number to overread. More lines can mean more output. They can also mean more churn, more review load, or more code that somebody has to clean up later.

    Cursor chart showing weekly lines added per developer
    Cursor chart showing weekly lines added per developer

    Source: The Cursor Developer Habits Report

    The PR data is harder to ignore. The p75 lines added per PR went from 125.86 to 345.02, and the share of merged PRs with at least 1,000 changed lines rose from 8.0% to 13.8%. That changes the reviewer’s job. A larger diff needs a clearer intent, better tests, and a smaller blast radius.

    Cursor chart showing p75 lines added per pull request
    Cursor chart showing p75 lines added per pull request

    Source: The Cursor Developer Habits Report

    Cost is part of the story too. Cursor shows average agent request cost varying from $1.57 for opus 4.7 to $0.18 for composer 2.5. The gap gets narrower when measured by accepted added line, but it does not go away. Model choice now affects product quality and margins at the same time.

    Cursor chart comparing average agent request cost by model
    Cursor chart comparing average agent request cost by model

    Source: The Cursor Developer Habits Report

    Why this is worth watching

    The Cursor Developer Habits Report is useful because it shows the awkward middle stage of AI coding. The tools are good enough to change how people work, but not clean enough to remove the need for discipline.

    Bigger PRs are not automatically better. Deeper agent sessions are not automatically safer. Cursor also reports that the 60-minute survival share for accepted AI lines rose from roughly 76% to 81%, which is a decent signal. But a line surviving for an hour is not the same as a line staying cheap to maintain for six months.

    The power-user gap may be the most important part. If the top users learn how to scope work, feed context, inspect diffs, and run checks, their curve bends faster than everyone else’s. Buying the tool does not spread that skill evenly across a team.

    Cursor chart showing AI usage concentration and Gini scores
    Cursor chart showing AI usage concentration and Gini scores

    Source: The Cursor Developer Habits Report

    AI coding notes for builders

    For developer-tool teams, the context numbers are the part to stare at. The input/output token ratio climbed above 11×. That suggests the agent experience is becoming a reading problem as much as a writing problem.

    Cursor chart showing input to output token ratio growth
    Cursor chart showing input to output token ratio growth

    Source: The Cursor Developer Habits Report

    Repo maps, search, cache behavior, tool calls, terminal output, and review surfaces may matter as much as the base model. Users do not experience “model quality” in the abstract. They notice whether the agent understood their codebase or confidently edited the wrong thing.

    What the discussion is missing

    Cursor’s data comes from real product usage, which makes it more useful than a survey. It is still Cursor’s own user base. Treat it as a strong signal, not an industry-wide average.

    The missing comparison is downstream quality. Defect rates. Rollbacks. Review time. Test coverage. Maintenance cost after AI-assisted changes land. Lines added and PR size are easy to chart. Engineering health is where the bill shows up later.

    The practical read

    Engineering leaders should watch review systems alongside AI adoption. If agents make PRs larger, teams need sharper change descriptions, better test evidence, and a habit of splitting risky work before it becomes unreadable.

    Individual developers should treat AI coding as a workflow skill. Ask for smaller changes. Provide the files that matter. Read the diff. Run the tests. Reject output quickly when it drifts. That sounds boring, but that is the difference between speed and cleanup.

    For more AI and developer-tool coverage, see the AI & Technology archive.

    Sources

  • Boring technology is a sharper engineering bet than it sounds

    Boring technology is a sharper engineering bet than it sounds

    Boring technology is not a plea for timid engineering. Dan McKinley’s 2015 essay argues that teams have a limited budget for novelty, and spending it on databases, queues, deployment plumbing, and service discovery can quietly steal attention from the product itself.

    The short version

    • McKinley’s core idea is the “innovation token”: every unfamiliar technology consumes attention, debugging time, hiring capacity, and operational patience.
    • “Boring” means well understood, not low quality. MySQL, Postgres, Python, Cron, and similar tools are boring because their failure modes are easier to predict.
    • The advice is strongest for startups and small teams. A tool that looks optimal for one subsystem can make the whole company harder to operate.
    • New technology still has a place when it is central to the product or removes a real constraint. The bar should be higher than “the demo looked good.”

    What happened

    Dan McKinley published “Choose Boring Technology” in 2015, drawing on his time at Etsy and on lessons from technical leadership there. The essay has kept circulating because it gives engineers a simple way to talk about platform risk without turning every stack debate into taste warfare.

    The memorable frame is that each company gets only a few innovation tokens. Pick Node.js, MongoDB, a new service discovery system, or a homegrown database, and you have spent one. The exact examples have aged, which is part of the point. Some technologies that felt risky in 2015 are ordinary now. The useful question is not whether a named tool is permanently safe or unsafe. It is whether your team already understands the tool’s limits, failure modes, and maintenance cost.

    McKinley is not arguing that teams should freeze their stack forever. He is arguing for global optimization. A tool can be the best local answer for one feature and still be the wrong company-level choice once monitoring, testing, hiring, incident response, and handoff costs enter the picture.

    Why this is worth watching

    The essay reads differently in 2026 because AI infrastructure has made shiny-stack pressure worse. A team can now add a vector database, orchestration framework, eval harness, agent runtime, observability layer, and model gateway before it has proved that the product solves a real user problem.

    That does not mean teams should avoid the AI stack. It means the “innovation token” model is even more useful. If the product’s real risk is model quality, workflow fit, or distribution, then spending novelty on routine plumbing is expensive. For more posts on practical tech judgment, see the IT & AI archive.

    The sharper reading is this: boring technology buys room to be bold somewhere else. A startup may need a risky model workflow or a new interface pattern. It probably does not need five risky infrastructure choices at the same time.

    What Hacker News readers are arguing about

    The Hacker News discussion is old but still useful because it shows where the advice meets developer identity. Many readers agreed with the broad lesson: code and infrastructure carry a maintenance cost, and chasing trends can become resume padding disguised as architecture.

    The pushback was more interesting than a simple pro-boring consensus. Some commenters argued that code is also an asset, not only a liability, and that speculative learning is part of becoming a better engineer. Others pointed out that “boring” changes with time. Node.js and MongoDB were used as examples of novelty in the original essay, but by the 2021 discussion several readers argued that Node had become mainstream enough to count as boring in many teams.

    The practical split is really about context. A consultancy, database company, or developer platform may have a good reason to spend tokens on the core technology it sells. A payments startup or marketplace usually has less reason to invent its own operational substrate. The thread also returns to hiring: familiar stacks are easier to staff, review, debug, and hand off when the first expert leaves.

    Boring technology in practice

    A useful stack review can be blunt. List every major system that needs special knowledge: database, queue, runtime, deployment layer, auth, observability, AI orchestration, and data pipeline. Then ask which choices are essential to the company’s edge and which ones are merely interesting.

    For each nonstandard choice, write down who can operate it during an incident, how it fails under load, how the team tests it, what migration would cost, and whether the same user outcome could be reached with a familiar tool. If nobody can answer those questions, the team may be spending an innovation token without admitting it.

    This is especially relevant for app builders and developer tool teams. Product discovery and marketplace rankings tend to reward visible features, but retention often comes from reliability. A tool that lets customers keep their boring stack while adding one valuable capability may be easier to adopt than a product that demands a full platform rethink.

    The practical read

    Use boring technology as a default, not a religion. If a new tool removes the main bottleneck in your business, test it seriously. If it only makes the architecture diagram look more current, leave it out.

    The best version of McKinley’s advice is not anti-innovation. It is anti-waste. Save the weirdness for the part of the product where weirdness actually compounds. Everywhere else, boring is often what lets the team keep shipping.

    Sources

  • SQLite agentic code policy draws a hard line for AI patches

    SQLite agentic code policy draws a hard line for AI patches

    SQLite added a plain rule to its repository guidance: it does not accept SQLite agentic code as a contribution. The project still welcomes bug reports that include a reproducible test case, which makes this less of an anti-AI manifesto and more of a maintenance boundary for a public-domain database used almost everywhere.

    The short version

    • SQLite’s AGENTS.md says the project does not accept agentic code, even though maintainers may review concise proof-of-concept patches before reimplementing changes themselves.
    • The project separates code contributions from bug reports: AI-assisted reports are acceptable when they include a reproducible test case.
    • The policy is tied to public-domain requirements, long-lived C code, Fossil-based development, and the cost of reviewing patches the maintainers did not write.
    • For AI coding tools, the useful lesson is blunt: a good repro may travel farther than a generated patch.

    What happened

    SQLite now has an AGENTS.md file aimed at people pointing coding agents at the SQLite source tree. The file explains project basics, build commands, testing commands, repository conventions, and contribution rules.

    The sharp part is the contribution policy. SQLite says it does not accept pull requests without prior agreement or legal paperwork that places the contribution in the public domain. It also says, in a separate sentence, that SQLite does not accept agentic code. Maintainers may still review a short, well-written pull request as a proof of concept, but the human SQLite developers reimplement accepted ideas themselves.

    That distinction matters because SQLite is not run like a typical GitHub-first project. Its canonical repository is Fossil, not Git, and its public-domain status is part of the project’s identity. A generated patch is not only a review burden. It can also blur authorship and provenance in a codebase that treats those details seriously.

    Why this is worth watching

    Most open source projects will not copy SQLite word for word. Plenty of maintainers do accept pull requests, and many projects live inside GitHub’s normal review flow. Still, SQLite has given maintainers a clean pattern: reject AI-written code as merge material while accepting AI-assisted evidence when it helps a human reproduce the problem.

    That is a useful split. A patch asks maintainers to trust the author, the code path, the licensing story, the tests, and the future maintenance cost. A reproducible bug report asks them to verify a failure. Those are different jobs.

    The wider lesson for developer tools is that output format matters. If an AI coding assistant produces a patch with no small failing test, it may be creating work for the maintainer. If it produces a minimal case, commands to reproduce it, and enough context for a person to inspect the failure, it has a better chance of being useful.

    For more coverage of developer-tool policy and AI engineering practice, see the IT & AI archive.

    What Hacker News readers are arguing about

    The Hacker News thread around Simon Willison’s write-up is small, so there is not enough there to claim a broad community consensus. The useful point in the comments is a clarification: SQLite is not refusing every artifact touched by an agent. It is refusing agent-written code as codebase input, while still allowing possible fixes to appear as documentation and accepting reproducible bug reports.

    A related earlier discussion on the prototype AGENTS.md commit framed the policy as a reasonable compromise. The tone was less “AI is banned” and more “give agent users rules, then keep generated code out of the project unless a human maintainer owns the final implementation.” That reading fits the file itself.

    The argument that remains open is practical. If AI tools get better at producing tests, minimization steps, and failure cases, maintainers may welcome them as triage tools. If the tools mostly produce plausible patches, projects with strict ownership rules will keep pushing back.

    SQLite agentic code policy in practice

    SQLite agentic code is the wrong deliverable for this project. A reproducible test case is the right one.

    That should influence how developers use coding agents around mature open source infrastructure. Instead of asking an agent to “fix SQLite,” ask it to isolate the failing behavior, reduce the input, show the exact command that fails, and explain why the result conflicts with documented behavior. If a patch is generated along the way, treat it as a debugging note, not as something to submit.

    For coding-agent companies, this is also a product signal. The next useful feature may not be a bigger diff. It may be a maintainer-friendly report: environment, build command, failing test, expected result, actual result, and a short explanation a human can audit.

    The practical read

    If you maintain an open source project, SQLite’s policy is a good starting template even if you soften the wording. Say whether you accept AI-written patches. Say whether AI-assisted bug reports are allowed. Say what evidence makes a report useful. The policy does not need to be dramatic; it needs to reduce ambiguity before the first generated pull request lands.

    If you contribute to projects with AI help, submit less code and better evidence. A concise failing test and reproduction steps respect the maintainer’s time. A large generated patch shifts the risk to someone else.

    Sources

  • Claude Code dynamic workflows raise the bar for agentic coding

    Claude Code dynamic workflows raise the bar for agentic coding

    Claude Code dynamic workflows are Anthropic’s new attempt to make AI coding agents handle work that usually breaks a single chat session: large migrations, broad bug hunts, code review passes, and security audits. The feature lets Claude Code create orchestration scripts, fan work out to tens or hundreds of subagents, and fold the results back into one coordinated answer.

    The short version

    • Anthropic says Claude Code can now split large coding tasks into parallel subagents, then check the results before combining them.
    • The headline case is Bun’s Zig-to-Rust port: roughly 750,000 lines of Rust, 99.8% of the existing test suite passing, and 11 days from first commit to merge.
    • The feature is available in research preview for Claude Code CLI, Desktop, the VS Code extension, the API, Amazon Bedrock, Vertex AI, and Microsoft Foundry.
    • The useful question is not whether agents can generate more code. It is whether teams can afford the tokens, trust the tests, and review the output without losing control.

    What happened

    Anthropic introduced dynamic workflows for Claude Code on May 28, 2026. The feature is built for tasks that have too much breadth for one agent pass: searching a service for related bugs, migrating many files, stress-testing a plan, or running several review angles before a team commits to a change.

    The mechanics matter. Claude Code plans from the prompt, breaks the work into subtasks, runs subagents in parallel, checks the outputs, and keeps iterating until the answers converge. Anthropic also says progress is saved during longer runs, so an interrupted job can resume instead of starting from zero.

    Availability is broad, but not identical across plans. Max and Team users, plus API users, get the feature on by default. Enterprise customers need an admin to enable it. Anthropic also warns that the feature can use substantially more tokens than a normal Claude Code session, which is probably the first thing a team should test before pointing it at a real migration.

    Why this is worth watching

    The Bun example is the reason this announcement is getting attention. Anthropic says Jarred Sumner used dynamic workflows to port Bun from Zig to Rust, with one workflow mapping Rust lifetimes for struct fields, another writing behavior-identical Rust files from Zig counterparts, and a fix loop driving builds and tests until they passed.

    That is an impressive story, but it is also a narrow one. Bun had an owner who knew the codebase deeply and a test suite strong enough to be a useful target. Many companies have neither. In those environments, faster agent output can create a larger review burden instead of a cleaner path to shipping.

    The more durable shift is that coding tools are moving from autocomplete toward orchestration. For more coverage of that shift, the IT & AI archive tracks similar developer-tool and AI infrastructure moves. Claude Code dynamic workflows fit that pattern: the product is less about a clever code suggestion and more about managing a temporary swarm of workers around a codebase.

    What Hacker News readers are arguing about

    The Hacker News discussion is skeptical in a useful way. Several commenters read the launch as a token-burn feature first and a productivity feature second. Their concern is straightforward: more agents, more reviewers, and longer runs can multiply usage before a team knows whether the result is correct.

    The strongest technical objection is about ground truth. Bun is a convenient proof point because a port can be checked against an existing behavior model and a large test suite. Most software work is messier. Product intent, hidden invariants, flaky tests, and review judgment are harder to encode than “make the tests pass.” A few commenters described agents drifting from the requested task or even damaging the test harness while still producing passing CI.

    The builder argument is not empty, though. Some commenters said more tokens can be worth it when they buy independent review passes, adversarial checks, and broader search across a codebase. Jarred Sumner also joined the thread to say dynamic workflows made Claude more effective at complex long-running tasks, describing the workflow as closer to a task-specific build system than a freeform chat.

    The thread lands in a practical middle: parallel agents may help when the task is wide, testable, and well-scoped. They look much weaker when the team cannot define success, interrupt the run cleanly, inspect decisions, or cap cost.

    Claude Code dynamic workflows in practice

    The safest mental model is a temporary build system for one difficult job. You give it a narrow target, enough checks to catch bad work, and a human-owned merge gate at the end.

    The practical read

    Treat Claude Code dynamic workflows as an orchestration tool, not a replacement for engineering judgment. The first good use case is not a vague feature build. It is a bounded job with a reliable check: a mechanical migration, dead-code discovery, broad static review, security candidate search, or a refactor guarded by tests.

    Teams should run one small pilot and measure four things before expanding it: token cost, changed-line volume, review time, and defect rate after human review. If those numbers are worse than a normal Claude Code session, the parallelism is noise. If they are better, the next question is governance: who can start long runs, which repositories are allowed, where logs live, and what must be reviewed before merge.

    For app and developer-tool builders, the product lesson is clear enough. Discovery surfaces for coding assistants will increasingly reward tools that explain control, auditability, and workflow repeatability. Raw generation speed is no longer the whole pitch.

    Sources

  • The orchestration tax is the real limit on AI agents

    The orchestration tax is the real limit on AI agents

    The orchestration tax is Addy Osmani’s name for the cost developers pay when many AI agents produce work faster than one human can review it. The pitch for agentic coding often starts with parallelism. The harder question is what happens when every result still has to pass through one person’s judgment.

    The short version

    • The orchestration tax appears when launching AI agents is cheap but reviewing their output is still slow.
    • Osmani argues that the scarce resource is not agent runtime. It is the developer’s attention.
    • More agents can deepen the review queue instead of increasing merged, reliable work.
    • The useful operating rule is backpressure: scale agents to review capacity, not to whatever the UI allows.
    • This matters for builders of agent tools because workflow design may matter more than raw parallel execution.

    What happened

    Addy Osmani, a Google Cloud AI director and longtime developer advocate, published a short article on X titled “The Orchestration Tax.” It grew out of a Google I/O panel with Richard Seroter, Aja Hammerly, and Ciera Jaspan about where software engineering is going as AI agents become normal parts of the development workflow.

    His argument is blunt: starting more AI agents is easy, but there is still only one person reading the diffs, resolving conflicts, and deciding whether the work fits the system. That makes the human reviewer the serial resource in an otherwise concurrent system.

    Osmani uses two familiar software ideas to frame the problem. The first is Python’s Global Interpreter Lock: many threads can exist, but work that needs the lock still runs through one choke point. The second is Amdahl’s Law: parallel speedup is capped by the part of the process that cannot be parallelized. In agentic software development, he says, that serial part is judgment.

    Why this is worth watching

    The orchestration tax is useful because it moves the AI agent debate away from demos and toward throughput. A dashboard full of active agents can feel productive. That does not mean the team is shipping better code.

    The risk is cognitive debt. If developers accept agent output because forming an independent opinion has become too tiring, they lose the mental model of their own codebase. The failure may not show up on the same day. It shows up later, when production breaks and nobody can explain why the system works that way.

    Osmani’s practical advice is closer to systems design than personal discipline. Use backpressure. Keep the parallel agent count low enough that reviews stay real. Split work into two piles: isolated tasks that can run in the background, and complex tasks where the human judgment is the work. Batch reviews instead of constantly context-switching between half-finished threads.

    That framing is also relevant to the broader IT & AI archive, where the pattern keeps repeating: the biggest gains from AI tools often depend on the boring workflow around the model.

    What Hacker News readers are arguing about

    There is a Hacker News submission for the piece, but it had no meaningful discussion at the time this brief was prepared. That silence is still mildly interesting. The post is not a benchmark launch or a new model release, so it does not trigger the usual speed-and-capability fight.

    The useful debate to have is more operational: how many agents can one engineer review without lowering standards, and what evidence should an agent produce before a human spends attention on it? Tests, screenshots, small diffs, and clear handoff notes are not glamorous. They are what make agent work reviewable.

    A fair skeptical view is that “orchestration tax” may just be a new label for old engineering management problems: code review queues, merge conflicts, and context switching. That is partly true. The new part is the imbalance. AI agents make it much easier to create parallel work than to create parallel understanding.

    The practical read on orchestration tax

    If you run AI coding agents, treat review attention as a capacity limit. Start with one or two concurrent agents on contained tasks. Require each agent to produce evidence before you review the result: a passing test, a screenshot, a concise change summary, or a small diff that can be understood in one sitting.

    Do not use agent count as the success metric. Use merged work that you still understand. If the queue grows faster than you can review it, reduce the fleet. The orchestration tax is paid either way; the choice is whether you pay it deliberately in workflow design or later through shallow reviews and stale system knowledge.

    For product teams building agent platforms, the lesson is awkward but valuable. The winning feature may not be “run 20 agents.” It may be better batching, clearer review packets, dependency boundaries, and defaults that protect the human reviewer from becoming the bottleneck nobody wants to measure.

    Sources