Tag: Durable Workflows

  • SQLite durable workflows make a small-stack case for agent infrastructure

    SQLite durable workflows make a small-stack case for agent infrastructure

    SQLite durable workflows are a bet that many agent systems need reliable state more than they need a heavy orchestration platform on day one. Obelisk argues that a local SQLite database, backed up with Litestream to S3-compatible storage, can be enough for small durable execution systems where losing the newest local writes is acceptable.

    The short version

    • Obelisk’s argument is narrow but useful: keep workflow state close to the runtime, persist an execution log, and replay from history when work resumes.
    • Litestream adds portability by streaming SQLite changes to object storage, but the replication is asynchronous.
    • The pattern fits bursty AI agents, internal automation, prototypes, and tenant-isolated workloads better than large shared systems.
    • Postgres still makes more sense when teams need strong availability, shared writes, mature operations, or a durability model that cannot lose recent local writes.

    SQLite durable workflows in one sentence

    SQLite durable workflows turn a database file into the recovery point for a run, while Litestream makes that file easier to back up and move.

    What happened

    Obelisk published a short piece arguing that SQLite can be enough for a large class of durable workflow systems. The post responds to DBOS’s recent “Postgres is all you need for durable execution” framing and pushes the same idea toward an even smaller database: if the durable part is workflow state, the compute can be disposable.

    The design is simple. An Obelisk server writes workflow progress to SQLite. Workflows can replay from persisted history, and failed activities can be retried. Litestream then streams SQLite changes to S3-compatible object storage for backup, migration, and inspection.

    That last word matters. The article is not claiming that SQLite plus Litestream gives you the same behavior as a highly available shared database. Litestream replication is asynchronous, so a restore can miss the newest writes if the local volume disappears before those writes are copied.

    Why this is worth watching

    SQLite durable workflows are interesting because they match how a lot of agent infrastructure is being built right now: small workers, short spikes of activity, many experiments, and state that is easier to understand when it belongs to one agent or one tenant.

    For that shape, a database file is not a toy. It is a debugging artifact. You can copy it, inspect it locally, replay a run, or move one tenant without dragging a central system into every step. That is different from saying SQLite should replace Postgres everywhere. It is closer to saying that some workflows are naturally partitioned, and those partitions can be operational units.

    The pattern also lines up with a cost question that keeps showing up in developer tools. Before a team adds Temporal, Step Functions, a Postgres-backed workflow engine, or a full control plane, it can ask a smaller question: can the state model survive restarts with SQLite and object storage? For more briefings like this, the IT & AI archive tracks the developer infrastructure stories that keep resurfacing.

    What Hacker News readers are arguing about

    The Hacker News discussion is useful because it pushes back on the word “durable.” The strongest skeptical camp argues that once Litestream’s asynchronous replication is part of the story, the system may be durable enough for experiments but not durable in the stricter production sense. Several commenters called out the risk of losing the most recent local writes, and one reported replacing Litestream in production after upgrade and disk usage concerns.

    The builder camp is more sympathetic. A few commenters said they already use SQLite-backed task state for agents or pipelines because it keeps iteration simple. One pattern that came up: ask an agent to plan a DAG, store each task in SQLite, and rerun only the steps that changed. Another practical argument was token cost. Agents can query a row instead of rereading a pile of Markdown or logs.

    There was also a familiar SQLite-versus-Postgres fight. Critics argued that SQLite is the wrong tool for concurrent production systems. Supporters answered that many workloads do not need multiple writers across machines, and that strongly partitioned state changes the tradeoff. The thread is not evidence that the architecture is safe. It is a good map of where teams will disagree: recent-write loss, concurrency, operator comfort, and whether a workflow engine is worth the overhead.

    The practical read

    Use SQLite durable workflows when the workflow state is small, naturally partitioned, and valuable to inspect. That describes a lot of AI agent workloads: tool calls, step logs, inputs, outputs, retries, and run history for one tenant or one worker.

    Do not use this pattern as a blanket replacement for Postgres or Temporal. If multiple services need to coordinate writes, if the newest write must survive a node loss, or if operations already depend on database-level replication and failover, a network database or dedicated workflow engine is the safer default.

    The good test is plain: if you can explain exactly which writes may be lost before Litestream catches up, and the product can tolerate that, SQLite plus object storage may keep the stack pleasantly small. If that sentence makes you nervous, it probably should.

    Sources