Tag: Backend Engineering

  • Shopify MySQL inventory reservations: 5 lessons

    Shopify MySQL inventory reservations: 5 lessons

    Shopify MySQL inventory reservations are a useful reminder that a database migration story can be less about raw speed than about removing awkward failure modes. Shopify moved checkout-time inventory holds from Redis into MySQL so reservations and the inventory ledger could live inside the same ACID transaction boundary. The interesting part is how much work it took around SKIP LOCKED, schema shape, isolation level, lock ordering, and connection visibility before the design held up at peak commerce traffic.

    The short version

    • Shopify’s old Redis reservation system handled concurrency, but Redis and the MySQL inventory ledger could not be claimed in one atomic step.
    • The MySQL design used one row per sellable unit, capped the available row pool at 1,000 per item/location pair, and relied on SKIP LOCKED to avoid waiting on rows another checkout had already taken.
    • The migration was not a blanket “MySQL beats Redis” claim. It worked because Shopify changed the data model, tuned transaction behavior, and instrumented the full checkout path.
    • The surprising bottleneck was connection hold time, not simply reservation query latency or database CPU.
    • Shopify says the system handled high-volume flash-sale traffic with writer CPU under 50% and reader CPU under 16% after cleanup and configuration changes.

    What happened

    Shopify published an engineering write-up explaining how it replaced a Redis-backed inventory reservation path with a MySQL design for checkout. The reservation step is the short hold that happens while a buyer is paying. If it is wrong, one buyer may purchase stock that no longer exists, or another buyer may be told an item is sold out when it is still available.

    The old Redis model used operations like DECR and INCR on quantity keys. That was fast enough for concurrency, but it split the reservation state from the MySQL inventory ledger. Once payment succeeded, Shopify had to update MySQL and clean up Redis without a single atomic transaction across both systems.

    The new design put reservations in MySQL. Instead of updating one quantity column for an item, Shopify represented sellable units as rows. A checkout that needs three units selects three rows, skips rows locked by other transactions, and moves the selected units inside the database transaction. That is the core of Shopify MySQL inventory reservations.

    Why this is worth watching for Shopify MySQL inventory reservations

    The practical lesson is that SKIP LOCKED is not magic dust. It only helped because Shopify changed the shape of the data. A single hot row with a quantity column still creates contention. A pool of unit rows gives MySQL something useful to skip.

    Shopify also bounded the row pool. Keeping one row for every unit everywhere would explode for high-stock items, so the system caps available rows at 1,000 per item/location combination and uses a replenishment process to refill the pool from the ledger. That detail matters. It turns a clever locking trick into a design that can survive real catalog size.

    The engineering work continued below the schema. Shopify moved the relevant transactions to READ COMMITTED to avoid gap-lock behavior that blocked replenishment, fixed deadlocks by enforcing a consistent table lock order, and batched multi-line carts with UNION ALL to reduce round trips. For readers who follow backend infrastructure, the broader IT & AI archive is useful because this is the kind of systems story where the headline undersells the operational work.

    What Hacker News readers are arguing about

    The public Hacker News submissions I found were quiet: low score, no comments on the linked discussion. So there is no meaningful community argument to summarize from that thread.

    That silence is still telling in a small way. This is not a flashy framework launch or a new database benchmark. It is an operations-heavy post about transaction boundaries, lock behavior, and connection pools. The missing debate is the one backend teams should have internally: whether a separate coordination service is buying enough simplicity to justify the consistency and operating cost it adds.

    If a team reads the Shopify story as “replace Redis with MySQL,” it will copy the least important part. The useful question is narrower: can the source of truth, the reservation state, and the failure recovery path sit inside one transaction without making the checkout path a bad neighbor for every other database workload?

    The practical read

    Shopify MySQL inventory reservations are worth reading before you add Redis, Kafka, or a custom lock service to a checkout path. The first check is not “which tool is faster?” It is “what state must change atomically, and where does that state live?”

    For builders, the migration suggests five concrete checks:

    • Model contention explicitly. If every buyer fights over the same row, the database choice will not save you.
    • Test the isolation level you actually need. Default settings can be wrong for a narrow high-throughput path.
    • Keep lock acquisition order boring and consistent.
    • Measure connection hold time by caller, not only query latency.
    • Roll out with shadow mode or dual writes when the old system is still the safer source of truth.

    The app-builder angle is straightforward: checkout reliability affects conversion. For commerce apps, marketplaces, and inventory plugins, a reservation bug is not a backend detail. It can become a canceled order, a support ticket, or a merchant who stops trusting the platform.

    Sources

  • Postgres workflows make durable execution feel boring

    Postgres workflows make durable execution feel boring

    Postgres workflows are getting a fresh look because DBOS argues that durable execution does not always need a separate orchestration service. The pitch is simple: store workflow state, step outputs, locks, and recovery checkpoints in PostgreSQL, then let application servers coordinate through the database they already operate.

    The short version

    • DBOS describes a durable execution model where application servers poll a Postgres workflows table, checkpoint each step, and recover crashed jobs from the last completed step.
    • The technical bet is that row locking, uniqueness constraints, indexes, SQL queries, and normal Postgres operations can replace a chunk of what teams buy from external orchestrators.
    • This is most attractive when the workflow is close to the application domain and the team already trusts Postgres in production.
    • The hard parts do not disappear. Payload size, hot tables, transaction retries, worker crashes, and retry semantics still need explicit design.
    • The broader developer-tool angle is practical: agent runs, video processing, document pipelines, and AI background jobs all need durable execution, but many teams do not want another distributed system first.

    What happened

    DBOS published a technical argument for Postgres workflows as a simpler durable execution architecture. In the conventional model, systems such as Temporal, Airflow, and AWS Step Functions coordinate workflow execution through a central orchestrator. A worker completes a step, reports the result to the orchestrator, and the orchestrator records the checkpoint before dispatching the next step.

    DBOS flips that arrangement. A client creates a workflow record in Postgres. Application servers dequeue work from the table, checkpoint step outputs directly to Postgres, and recover another server’s unfinished work if a process dies. The post points to locking clauses for safe worker competition, integrity constraints for detecting duplicate step writes, SQL for observability, and existing Postgres security and availability practices for operations.

    The article also claims that a single Postgres server can handle tens of thousands of workflows per second in the right setup, with distributed or sharded Postgres systems as later options. That number is less useful than the shape of the claim: durable execution is mostly about making progress durable, and a relational database is already built to make state durable.

    Why this is worth watching

    Postgres workflows are interesting because they move the orchestration question back into the data model. If each step result is a row with clear idempotency rules, the system becomes easier to inspect. A failed payment email, stuck file conversion, or half-finished AI agent run can be queried with SQL before anyone builds a custom dashboard.

    That is the best version of this idea. It does not say every team should replace Temporal tomorrow. It says many teams reach for a workflow platform before they have written down the actual state machine, retry boundary, and checkpoint model. Starting with Postgres can force those decisions into tables, indexes, and constraints. That can be refreshingly boring.

    There is also a product lesson here for developer-tool builders. The IT & AI archive keeps circling the same theme: teams want more reliability for background work, but they have little patience for heavy platforms unless the pain is already obvious. Postgres workflows fit that mood. They offer a path between ad hoc job queues and a full workflow stack.

    What Hacker News readers are arguing about

    The Hacker News discussion is useful because it separates the slogan from the operational details. Several engineers liked the general pattern, especially for queues built with SELECT FOR UPDATE SKIP LOCKED or advisory locks. The pro-Postgres camp mostly argued from experience: if Postgres is already in the stack, a workflow table can be cheaper and easier to reason about than another service.

    The skepticism was more specific. One thread challenged the article’s mention of CockroachDB as a way to scale Postgres-like systems, with commenters pointing to compatibility gaps, missing operators, index limitations, and repeated serialization_failure retries in real systems. That is a reminder that “Postgres-compatible” is not the same as “Postgres with the same operational behavior.”

    Temporal also dominated part of the thread. Some commenters described large self-hosted Temporal deployments as expensive and infrastructure-heavy, while others pushed back that those workloads may be a poor fit or that Temporal Cloud pricing can look reasonable depending on event volume. The useful takeaway is not that Temporal is bad. It is that workflow engines have their own cost curve, and teams should compare that curve against the complexity they would add to Postgres.

    A smaller but important thread focused on payload size. People were wary of putting large documents or video artifacts directly in a queue or workflow table. The practical pattern is the old claim-check approach: store the large object elsewhere, then pass a reference through the workflow state. That applies whether the orchestrator is Postgres, Temporal, or a cloud queue.

    Where Postgres workflows fit

    Postgres workflows fit best when the workflow is part of your application, the steps can be made idempotent, and the team can model retries and checkpoints in SQL without turning the main database into a dumping ground.

    The practical read

    Use this pattern when the workflow is close to your product and your team already knows how to operate Postgres under load. This is a strong fit for internal job pipelines, AI agent tasks, document processing, notification chains, and service-local background work.

    Be more cautious when the workflow spans many teams, languages, approval states, and long-running human processes. A dedicated workflow system may earn its weight there, especially if it gives you mature tooling around versioning, visibility, timeouts, and operator workflows.

    The test is not ideological. Sketch one real workflow. Count the steps. Write down what each step stores, how it retries, what happens after a worker crash, and where large payloads live. If that design fits naturally into Postgres tables and constraints, DBOS’s argument deserves a serious look. If the model starts turning into a private orchestration platform, buy or adopt the platform instead.

    For app builders, the ASO angle is indirect but real: background reliability is becoming part of product discovery. Users do not search app stores for “durable execution,” but they do notice when uploads, agent runs, and media processing quietly resume instead of failing.

    Sources