Tag: Startups

  • AI product building needs taste more than raw speed

    AI product building needs taste more than raw speed

    AI product building can now turn rough ideas into prototypes, screens, and working flows faster than most teams could a few years ago. In a June 2026 Figma essay, chief product officer Yuhki Yamashita argues that once making gets easier, the real advantage moves to choosing the right thing to make and shaping it with enough care that users can tell the difference. The point is especially relevant for teams using Figma Make, AI coding tools, or prompt-based prototyping to compress the path from idea to demo.

    The short version

    • Figma says speed is becoming table stakes as AI lowers the cost of turning product ideas into prototypes.
    • The harder job for product teams is choosing a direction before they spend weeks refining the wrong one.
    • Product teams should compare several concrete directions in parallel, not fall in love with the first plausible output.
    • Craft still matters because AI defaults can make products feel polished but interchangeable.

    What happened

    Figma published a June 2026 essay by chief product officer Yuhki Yamashita on what changes when AI lets more people build products. The article, titled “What Matters When Anyone Can Build,” frames the shift around a concrete product pressure: if many teams can generate screens, prototypes, and flows quickly, shipping speed alone becomes a weaker signal of product quality.

    The essay argues that builders face two traps. Newer teams can go deep on the first idea because AI makes that idea feel alive almost immediately. More experienced teams can stay too abstract, comparing strategy maps and wireframes without seeing how the end user experience feels. Figma’s proposed middle ground is to go broad and deep at the same time: explore multiple directions and push each far enough to be experienced, not merely described.

    That framing fits Figma’s own product direction. The company has been leaning into AI-assisted prototyping through tools such as Figma Make, where teams can generate interactive versions of an idea and compare them side by side. The article is part product philosophy, part pitch for a workflow where humans and AI agents test options together before a team commits.

    Why AI product building is worth watching

    AI product building is worth watching because the bottleneck is moving from production to judgment. When a team can make five plausible prototypes instead of one static mockup, the question changes from “can we build this?” to “which version deserves the team’s attention?” That is a more useful question, but it is also easier to dodge when every generated result looks polished enough to keep.

    Figma’s useful warning is that AI tools can accelerate a team inside a bad starting point. Agents tend to be helpful and agreeable. They extend the initial prompt, fill in missing pieces, and make the current direction look more complete. That makes local improvement feel productive even when the team has not checked whether the starting idea is the right one.

    The better habit is parallel exploration. Product managers, designers, founders, and engineers can ask for distinct directions, make each one concrete, and then compare actual flows. Teams get a better conversation when they react to screens, states, copy, and friction instead of arguing over a vague concept board.

    What does AI product building change for teams?

    AI product building changes the product team’s job by making taste, prioritization, and review harder to outsource. A model can propose layout patterns, write interface copy, or generate a clickable flow, but it does not know which trade-off fits the customer, the market, or the company’s appetite for risk. Teams still have to decide what problem is worth solving and what level of finish the first release needs.

    For founders and small app teams, this is a practical point rather than a design slogan. AI can shorten the distance between idea and demo, which is useful for app discovery, MVP testing, and investor conversations. It can also make weak ideas look more credible than they are. A generated prototype should start a sharper review: which user problem is this solving, what did the team intentionally leave out, and where does the experience still feel generic?

    For larger product teams, the collaboration pattern may matter as much as the tooling. Figma describes teammates and agents reacting together to multiple options. That pushes AI work out of a private prompt box and into a shared review process, where a team can challenge the defaults before they harden into the product.

    What the discussion is missing

    There was no reliable Hacker News thread for this specific Figma essay at the time of writing. The missing debate is still easy to name: Figma’s argument is strong on product craft, but it leaves open how teams should measure whether AI-assisted exploration actually improves decisions.

    The hard questions are operational. How many directions should a team generate before comparison becomes theater? Who decides when a prototype is realistic enough to test? How does a team avoid rewarding the most visually convincing option when the best product choice may be less flashy? Those questions matter because AI tools can produce a lot of plausible work, and plausible work can crowd out slow, uncomfortable customer evidence.

    A good discussion would also separate craft from polish. Figma is right that products can become interchangeable when teams accept model defaults. But a high-gloss interface is not the same as a cared-for product. The real test is whether the team can explain the choices behind the flow, the words, the empty states, the constraints, and the things it decided not to build.

    The practical read

    Teams using AI prototyping tools should treat the first output as evidence, not as a draft to protect. A practical review process starts with competing directions, pushes each one into a testable flow, and then compares the options against a real user problem. The generated UI matters only after the team can explain why this direction deserves to exist.

    The best use of this Figma essay is as a checklist for product reviews. Before a team ships, it should be able to answer three questions: did we explore more than one direction, did we choose this direction for a reason we can defend, and did we refine the parts users will actually feel? If the answer is no, the team may have used AI to move faster without getting closer to a better product.

    Readers tracking AI tools, design systems, and product workflows can find more related coverage in the IT & AI archive. The short version: faster building raises the bar for choosing well. Teams that treat AI product building as a review discipline, rather than a shortcut, will have a better chance of making products that feel intentional rather than merely generated.

    Sources

  • Product strategy questions: stop debating wide vs deep

    Product strategy questions: stop debating wide vs deep

    Product strategy questions can sound smart and still waste a room. Shreyas Doshi’s X article argues that “should we go wide or deep?” is often the wrong opening move, especially for an AI startup suddenly facing larger incumbents. The better question is smaller and harder: which customer, which pain, which feature, and which reason to buy?

    The short version

    • Doshi describes an AI startup founder whose team started debating whether to widen the product or deepen the current workflow after two large incumbents entered the space.
    • His advice is to reject the binary because it pulls teams into abstract language before they have named the customer bet.
    • The useful product strategy questions sit one level lower: what feature will resonate, who will buy because of it, and why will they stay?
    • For founders and PMs, the article is a reminder that frameworks do not rescue weak customer understanding.

    What happened

    Doshi published an X article titled “Get to the Core of the Thing” after advising a founder running an AI startup. The founder’s team was anxious because two established companies had moved into the same market. Their proposed frame was familiar: should the product expand its surface area, or should the team sharpen what it already had?

    Doshi’s answer was blunt. Drop the frame. In his view, a wide-versus-deep debate lets smart people sound strategic while avoiding the work that actually matters: naming the specific bet on a specific feature for a specific customer.

    That distinction matters because many product meetings drift upward. Teams start with a real market threat, then jump into platform versus point solution, CAC versus LTV, horizontal versus vertical, or whatever analogy sounds good that week. Those phrases can be useful later. They are dangerous when they arrive before the team has done the customer work.

    Why this is worth watching

    The article lands because AI product teams are living through exactly this kind of pressure. When a bigger company enters a category, a smaller team can feel pushed to look broader, more platform-like, or more defensible on a slide. That instinct is understandable. It can also blur the only question a customer cares about: does this product solve my problem better than the thing I already use?

    The piece is also useful for non-AI teams. “Wide or deep” is only one version of the trap. Founders can swap in “enterprise or SMB,” “workflow or infrastructure,” “self-serve or sales-led,” and still avoid the harder work. The language changes. The escape hatch is the same.

    A better meeting starts with product strategy questions that make the team prove what it knows. Which buyer felt the pain last week? What did they try before? Which feature would change the buying conversation? What can the team ship quickly enough to learn from real use?

    For more technology and AI briefs, the IT & AI archive tracks similar product and builder signals without turning every link into a trend forecast.

    What the discussion is missing

    There does not appear to be a Hacker News thread tied to this article. That is probably fine. Doshi’s post is less a news event than a product operating note, and the missing debate is the practical one inside teams: when is a framework helpful, and when is it camouflage?

    The useful objection is that teams still need high-level strategy. A startup cannot interview its way out of every positioning decision. The point is not to ban strategy language. It is to use it after the team can state the customer bet in plain language.

    The other open question is speed. Doshi says the team needs real differentiation and needs to build it quickly. That is the part many teams will agree with and still struggle to do. The test is whether the next roadmap meeting produces a feature bet someone can validate, or another hour of vocabulary.

    The practical read

    If your team is stuck in a wide-versus-deep debate, pause the labels and rewrite the agenda around product strategy questions.

    Ask who the customer is in a way that points to a real person or account, not a segment name. Ask what that customer is doing today instead of using your product. Ask which feature would change the purchase or retention decision. Ask whether your team can build enough of that feature to learn before the market moves again.

    If you cannot answer those questions, choosing “wide” or “deep” will not fix the product. It will only make the uncertainty sound organized. If you can answer them, the shape of the product usually becomes less mysterious. You go wider where the customer bet requires reach, and deeper where the buying reason requires depth.

    Product strategy questions to ask first

    Use these product strategy questions before the roadmap turns into a framing contest:

    • Which customer call, support ticket, renewal risk, or lost deal are we using as evidence?
    • Which feature would make that customer buy, stay, expand, or switch?
    • What do we believe competitors cannot copy quickly enough to erase the advantage?
    • What can we ship in the next cycle that will make the answer clearer?

    That is less glamorous than a strategy offsite. It is also harder to fake.

    Sources

  • AI application layer survival depends on workflow depth

    AI application layer survival depends on workflow depth

    The AI application layer is not dead, but the easy part of it looks dangerous. Joe Schmidt IV at a16z argues that startups building generic model-plus-connector products are walking straight toward OpenAI and Anthropic, while companies that own messy business workflows still have room to build.

    The short version

    • Horizontal AI tools for coding, writing, image creation, and simple connector workflows benefit directly from better frontier models.
    • The safer AI application layer opportunities sit in vertical workflows where approvals, audits, legacy systems, and domain rules matter.
    • a16z names four practical defenses: data loops, model routing, cost control, and governance.
    • The Hacker News thread was small, but the useful objection was sharp: if the answer is bespoke vertical stacks, the road to broad automation is messier than the hype suggests.

    What happened

    Schmidt frames the current AI startup anxiety as a map. The “Yellow Brick Road” is the path the labs are already walking: strong models, standard connectors such as Google Drive, Slack, Salesforce, Notion, and GitHub, plus an agent orchestration layer. Products in that lane improve when the model improves, so the model owner has better margins, distribution, and pricing power.

    The other side of the map is what he calls the rest of Oz. These are workflows where a model call is only one piece of the product. A sales agent, insurance underwriting tool, legal workflow, finance process, or healthcare operation may need role-specific sub-agents, deterministic software, approvals, audit trails, and integration with old systems that cannot be swapped out casually.

    The argument is also a warning to founders. If a startup is selling a smarter chat interface over the same connectors as everyone else, it may be selling a feature the labs can bundle. If it becomes the system where work is routed, checked, logged, and improved, the AI application layer has a better shot at becoming durable software.

    Why this is worth watching

    The useful part of the piece is its test for depth. A tool that sits on top of a customer system is easier to replace. A system that runs the work, captures the data, and handles governance is harder to pull out.

    AI application layer test for founders

    Schmidt points to four defenses. First, production usage can create data and learning loops that do not exist on the public web. Second, a vertical company can route tasks across multiple model vendors, open-source fine-tunes, and cheaper tiers instead of depending on one lab’s stack. Third, it can tune cost against the level of intelligence each sub-task needs. Fourth, it can become the control plane for permissions, audit logs, and compliance in a specific industry.

    That is also where the claim gets less glamorous. Much of the defensibility sounds like ordinary software work: deployment, edge cases, data cleanup, customer-specific configuration, permissions, and support. For more coverage of this kind of software shift, the IT & AI archive tracks related product and infrastructure stories.

    What Hacker News readers are arguing about

    The Hacker News discussion was tiny, so it should not be treated as a market signal. Still, one comment captured the strongest skeptical read: if the advice is to build bespoke vertical AI stacks, that sounds less like an imminent general-intelligence takeover and more like another generation of custom enterprise software.

    The commenter also raised three practical blockers. Many business processes are fuzzy because they exist to absorb edge cases. Some of the most valuable domains have security or compliance limits that make third-party inference hard to adopt. And if companies need more programmers to rebuild workflows around AI, that complicates the simple story that agents will replace labor by themselves.

    That objection does not kill the a16z thesis. It makes it more grounded. The AI application layer may survive because the hard work is not only model intelligence. It is the boring, expensive work of turning a messy process into software a customer can trust.

    The practical read

    Founders can use this as a quick filter. Count the steps in the workflow. Count the systems touched. Ask who approves the output, what gets logged, and what breaks if the model is wrong. If the answer is mostly “the user can rerun the prompt,” the product is probably on the road where labs have the advantage.

    If the answer involves customer-specific rules, compliance, multiple handoffs, data rights, and measurable business outcomes, the product has a better chance. That does not make it easy. It means the moat is less about having a clever agent demo and more about owning the work surface where the customer actually operates.

    For app builders, the ASO angle is similar: discovery will reward products that can explain a specific job and result, not another generic AI assistant claim. The AI application layer needs narrower promises and deeper execution.

    Sources

  • AI harness design is becoming the real software moat

    AI harness design is becoming the real software moat

    Tomasz Tunguz argues that the next software fight is moving away from polished SaaS screens and toward the AI harness, the operating layer that turns an LLM into something closer to a dependable worker. His useful framing is simple: models are powerful, but production agents need context, tools, memory, sandboxes, logs, policy, and cost control before they can handle real work.

    The short version: AI harness

    • Tunguz describes seven parts of an AI harness: context and memory, tools and action, orchestration, state, sandboxed compute, observability, and cost-aware workflow design.
    • The argument is less about replacing SaaS overnight and more about where software products now create value: in the runtime around the model.
    • For builders, the hard part is no longer choosing a model alone. It is deciding what the agent can see, what it can do, when it stops, and who can audit it later.
    • The startup opening is domain depth. If everyone can rent similar models, the product edge shifts toward messy workflow knowledge and safe execution.

    What happened

    Tunguz published “Software After AI,” a short essay on May 27, 2026, about the stack that sits around AI agents. The piece uses the word “harness” deliberately. A raw model can answer questions, but a working product has to constrain that model, feed it the right business context, expose tools safely, resume work after failures, and leave an audit trail.

    The seven-part list is practical rather than futuristic. Context and memory cover retrieval, short-term task history, and the company-specific recipes people usually keep in their heads. Tools and action cover registries, argument validation, approvals, dispatch, and failure handling. Orchestration covers the think-act-observe loop. State and persistence cover checkpoints and artifacts. Sandbox and compute cover isolated workspaces and credentials outside the model. Observability and governance cover tracing, evals, guardrails, and human review. Cost and workflow optimization cover the decision of which steps should be deterministic, which model should run each step, and where knowledge should live.

    Why this is worth watching

    The term AI harness is useful because it names the part of agent software that demos often hide. A demo can succeed once with a clever prompt. A product has to succeed repeatedly when the CRM record is stale, the tool call fails, the user asks for a risky change, or the model forgets what it was doing three steps ago.

    That is where the SaaS comparison gets interesting. Traditional SaaS products gave users a fixed interface over a database and a workflow. Agent products may hide more of the interface, but they cannot hide responsibility. If an agent refunds a customer, rewrites a contract, changes a cloud setting, or files a report, the company still needs permissions, logs, rollback paths, and a way to explain what happened.

    This is also a decent filter for AI product pitches. If a vendor talks only about the model, the demo, or a benchmark, the product may still be thin. The durable work is in the boring layer: retrieval quality, tool boundaries, state recovery, sandbox rules, evals, and unit economics. Readers who track AI infrastructure and developer tooling can find more coverage in the IT & AI archive.

    What the discussion is missing

    I could not find a dedicated Hacker News thread for this exact article. That absence is a little unfortunate, because the strongest debate would probably be among people building agents in production rather than people judging them from a launch video.

    The missing questions are the useful ones. How much of this AI harness should be a platform, and how much has to be custom per industry? Will MCP-style tool registries make agents safer, or will they mostly make unsafe access easier to wire up? Can evals catch the failures that matter in legal, medical, finance, or customer operations? And at what point does the harness become so complex that a deterministic workflow would have been cheaper and safer?

    Those are not objections to Tunguz’s framing. They are the next layer of the conversation. The essay says the harness is the new software battleground. The harder question is which parts of that battleground can be standardized.

    The practical read

    If you are building an agentic product, start with the AI harness before you polish the chat surface. Write down the tools the agent can call, the data it can read, the approvals it needs, the state it must preserve, and the failure cases it must recover from. Then decide which model belongs in each step.

    If you are buying AI software, ask a different set of questions. Do not stop at “Which model powers this?” Ask what context system it uses, how tool calls are logged, how sensitive actions are approved, how tasks resume after a crash, how evals run, and how costs are controlled as usage grows.

    And if you are a startup, the point is not to out-model the labs. You probably will not. The better bet is to know a workflow so well that your AI harness handles the annoying exceptions, handoffs, and audit needs that a general-purpose agent will miss.

    Sources

  • Boring technology is a sharper engineering bet than it sounds

    Boring technology is a sharper engineering bet than it sounds

    Boring technology is not a plea for timid engineering. Dan McKinley’s 2015 essay argues that teams have a limited budget for novelty, and spending it on databases, queues, deployment plumbing, and service discovery can quietly steal attention from the product itself.

    The short version

    • McKinley’s core idea is the “innovation token”: every unfamiliar technology consumes attention, debugging time, hiring capacity, and operational patience.
    • “Boring” means well understood, not low quality. MySQL, Postgres, Python, Cron, and similar tools are boring because their failure modes are easier to predict.
    • The advice is strongest for startups and small teams. A tool that looks optimal for one subsystem can make the whole company harder to operate.
    • New technology still has a place when it is central to the product or removes a real constraint. The bar should be higher than “the demo looked good.”

    What happened

    Dan McKinley published “Choose Boring Technology” in 2015, drawing on his time at Etsy and on lessons from technical leadership there. The essay has kept circulating because it gives engineers a simple way to talk about platform risk without turning every stack debate into taste warfare.

    The memorable frame is that each company gets only a few innovation tokens. Pick Node.js, MongoDB, a new service discovery system, or a homegrown database, and you have spent one. The exact examples have aged, which is part of the point. Some technologies that felt risky in 2015 are ordinary now. The useful question is not whether a named tool is permanently safe or unsafe. It is whether your team already understands the tool’s limits, failure modes, and maintenance cost.

    McKinley is not arguing that teams should freeze their stack forever. He is arguing for global optimization. A tool can be the best local answer for one feature and still be the wrong company-level choice once monitoring, testing, hiring, incident response, and handoff costs enter the picture.

    Why this is worth watching

    The essay reads differently in 2026 because AI infrastructure has made shiny-stack pressure worse. A team can now add a vector database, orchestration framework, eval harness, agent runtime, observability layer, and model gateway before it has proved that the product solves a real user problem.

    That does not mean teams should avoid the AI stack. It means the “innovation token” model is even more useful. If the product’s real risk is model quality, workflow fit, or distribution, then spending novelty on routine plumbing is expensive. For more posts on practical tech judgment, see the IT & AI archive.

    The sharper reading is this: boring technology buys room to be bold somewhere else. A startup may need a risky model workflow or a new interface pattern. It probably does not need five risky infrastructure choices at the same time.

    What Hacker News readers are arguing about

    The Hacker News discussion is old but still useful because it shows where the advice meets developer identity. Many readers agreed with the broad lesson: code and infrastructure carry a maintenance cost, and chasing trends can become resume padding disguised as architecture.

    The pushback was more interesting than a simple pro-boring consensus. Some commenters argued that code is also an asset, not only a liability, and that speculative learning is part of becoming a better engineer. Others pointed out that “boring” changes with time. Node.js and MongoDB were used as examples of novelty in the original essay, but by the 2021 discussion several readers argued that Node had become mainstream enough to count as boring in many teams.

    The practical split is really about context. A consultancy, database company, or developer platform may have a good reason to spend tokens on the core technology it sells. A payments startup or marketplace usually has less reason to invent its own operational substrate. The thread also returns to hiring: familiar stacks are easier to staff, review, debug, and hand off when the first expert leaves.

    Boring technology in practice

    A useful stack review can be blunt. List every major system that needs special knowledge: database, queue, runtime, deployment layer, auth, observability, AI orchestration, and data pipeline. Then ask which choices are essential to the company’s edge and which ones are merely interesting.

    For each nonstandard choice, write down who can operate it during an incident, how it fails under load, how the team tests it, what migration would cost, and whether the same user outcome could be reached with a familiar tool. If nobody can answer those questions, the team may be spending an innovation token without admitting it.

    This is especially relevant for app builders and developer tool teams. Product discovery and marketplace rankings tend to reward visible features, but retention often comes from reliability. A tool that lets customers keep their boring stack while adding one valuable capability may be easier to adopt than a product that demands a full platform rethink.

    The practical read

    Use boring technology as a default, not a religion. If a new tool removes the main bottleneck in your business, test it seriously. If it only makes the architecture diagram look more current, leave it out.

    The best version of McKinley’s advice is not anti-innovation. It is anti-waste. Save the weirdness for the part of the product where weirdness actually compounds. Everywhere else, boring is often what lets the team keep shipping.

    Sources

  • AI productivity claims are running ahead of the work

    AI productivity claims are running ahead of the work

    TechCrunch’s report on Aaron Levie’s warning about “AI psychosis” among CEOs lands because it names a familiar gap: executives see a strong demo, while teams still have to make the work correct, safe, and shippable. AI productivity claims can sound persuasive before that last-mile work is counted. The issue is not whether AI agents are useful. They are. The question is whether companies can tell the difference between a good prototype and a finished business process.

    The short version

    • Box CEO Aaron Levie argued that CEOs are especially vulnerable to overestimating AI because they sit far from the last mile of work.
    • Layoffs.fyi counted 115,430 tech layoffs across 152 companies in the first five months of 2026, close to the 124,636 total it tracked for all of 2025.
    • ClickUp CEO Zeb Evans said the company cut 22% of staff after deploying roughly 3,000 AI agents, a useful case study in how quickly the narrative is moving.
    • The hard part is measurement: more drafts, tickets, pull requests, or proposals do not automatically mean better output.
    • Hacker News readers mostly argued about two things: whether “psychosis” is a fair label, and whether executives understand the review work that AI creates.

    What happened

    The TechCrunch piece starts with Levie’s claim that CEOs are “uniquely prone to AI psychosis” because they are far enough away from frontline work to miss the remaining labor needed to turn AI output into value. That is the sharpest point in the article. A CEO can ask an agent to draft a contract, generate HTML, summarize a customer call, or produce a product mockup. Those outputs can look convincing in a meeting. They still need review, context, policy checks, security judgment, and someone willing to be accountable when the answer is wrong.

    The article also puts that argument next to a rough labor-market backdrop. Layoffs.fyi’s tracker shows 115,430 tech layoffs from 152 companies in the first five months of 2026. That does not prove AI caused the layoffs. It does show why the story is sensitive: AI is becoming part of the language companies use when they explain smaller teams, faster execution, and new operating models.

    ClickUp is the most concrete example in the report. CEO Zeb Evans said the company had deployed about 3,000 AI agents and reduced staff by 22%, while trying to build what he called a “100x org.” That framing is exactly why this debate matters for builders. If agents become part of the org chart, companies need a much better answer to a basic operating question: who reviews the agent’s work, and what happens when the agent is confidently wrong?

    Why this is worth watching for AI productivity claims

    The useful read is that AI adoption is moving faster than AI measurement. A team can count how many agent runs completed. It can count the number of documents, tickets, or pull requests generated. Those are activity metrics. They do not say much about whether the work reduced customer pain, lowered error rates, increased revenue per employee, or freed experts from low-value chores.

    That distinction matters because the research record is still mixed. California Management Review’s summary of AI productivity evidence warns against easy claims that AI adoption produces broad productivity gains by itself. An NBER paper on executives and AI productivity points to a gap between perceived gains and measured outcomes. MIT FutureTech’s labor-task research also suggests that many tasks remain harder to automate at human-level quality than the demo cycle implies.

    The management bottleneck may simply move. Harvard Business Review has made a similar point: if AI increases the volume of output, managers can become the constraint because more work needs to be read, compared, approved, or rejected. Anyone who has reviewed AI-generated code or AI-written legal text knows the pattern. The first draft arrives faster. The expensive part is deciding whether it can be trusted.

    For more briefs on AI products, software teams, and workplace automation, see the IT & AI archive.

    What Hacker News readers are arguing about

    The Hacker News thread around the TechCrunch article is active and messy in the usual useful way. A large part of the discussion focuses on the word “psychosis.” Some readers called it clickbait or a cheap use of medical language. Others defended it as a cultural shorthand for executives becoming detached from what AI can actually do. The split is worth noting because it mirrors the broader AI debate: people agree there is overconfidence, then fight over how harshly to name it.

    The more practical thread is about distance from the work. Several commenters argued that this is not new. Executives have long seen a toy example, assumed the hard part was solved, and pushed a rollout that frontline teams had to absorb. The AI-specific twist is that LLMs can flatter the user while producing a plausible artifact. A CEO who prompts a chatbot into a small front-end demo may come away feeling closer to engineering than they really are.

    There was also a strong operator objection: AI can create review debt. One commenter described a CEO who hit real walls around data architecture and deployment after experimenting with AI prototyping. That is the sane version of the story. The tool helped explore an idea, then exposed the need for human-designed infrastructure. Another repeated concern was failure rate. If a model gets 80% or 90% of text tasks right, the remaining errors can still be disastrous in legal, security, finance, support, or production engineering contexts.

    The thread is not evidence, but it is a useful sentiment check. Builders are not rejecting AI agents outright. They are rejecting the jump from “this generated something impressive” to “this can replace the people who know where the traps are.”

    The practical read

    Companies should treat AI productivity claims like product claims. Define the workflow, the baseline, the quality bar, and the failure mode before tying the result to headcount. If an agent writes support replies, measure refund errors, escalation rates, customer satisfaction, and policy violations. If it writes code, measure review time, defect rate, rollback frequency, and maintenance cost. If it drafts contracts, measure legal review burden and clause-level risk.

    For AI agent startups and workplace apps, the pitch also needs to mature. “We deployed 3,000 agents” is a flashy number, but buyers will eventually ask which agents survived contact with real work. The products that win will probably be the boring ones that make review easier, preserve audit trails, route uncertain cases to humans, and prove that cycle time improved without hiding risk.

    For workers, the signal is more personal. The safer skill is not prompt fluency by itself. It is judgment over the last 20%: checking the output, knowing the domain constraints, spotting the quiet mistake, and deciding when automation should stop.

    Sources