Building durable workflows on Postgres

336 points • by KraftyOne • yesterday at 6:41 PM • 138 comments • view on HN

Comments

This feels like the sort of architecture that starts clean and then gradually grows most of the things a workflow-native system already has. I've seen systems like this, seen companies that are built out of this idea, and built small systems like this over time.

Once you need retries, backoff, timeouts, cancellation, versioning, visibility, task routing, rate limits, leases, heartbeats, stuck-worker detection, replay/debugging semantics, workflow migration, fanout/fanin, long timers, audit trails, and operator tooling, the “just use a database” story becomes “build a poor copy of a workflow engine plus a bunch of workers.” pretty quick.

That may still be a good tradeoff for many applications, especially if Postgres is already the core operational dependency. But the comparison shouldn’t be “database vs overcomplicated orchestrator.” It’s more like “what complexity do you want to own, and what do you want to buy / offload to a professional system?”

➕ show 6 replies

llimllib • yesterday at 7:36 PM

Armin Ronacher's `absurd` is an implementation of durable workflows for postgres:

https://lucumr.pocoo.org/2025/11/3/absurd-workflows/

https://github.com/earendil-works/absurd

https://earendil-works.github.io/absurd/

I've not used it, but it's worth comparing to other options

➕ show 1 reply

saxenaabhi • yesterday at 9:00 PM

As someone who uses dbos.dev, restate.dev, cf workflows here is a snippet from our Agents.md:

  Restate.dev:
    for payment integrations on northflank since its faster than cf workflows, independent of cf and its downtime and self-hostable vendor-lock-in free,
  Cloudflare workflows:
    for non critical stuff like csv/pdf report generations since it's very cheap.
  DBOS.dev:
    for workflows that need atomic messaging tied to a postgres db transaction for 100% reliabilty/durabilty(for example populating a materialized row or sending out critical email/push to a merchant).

DBOS and Restate are similar on surface but Restate requires a central "orchestrator" which has pros and cons but makes it easy to build with serverless workers on cf/vercel.

It also has VirtualObject which is a nice vendor-lock-in-free OSS alternative to CF's single threaded DurableObject.

Where DBOS absolutely shines is

1) Atomic messaging in the same db tx as your business logic via dbos.enqueue_workflow! This is often the most brittle part of any solution and doing it atomically and durably with same tx that ran your business logic drastically reduces lots of complexity.

2) Since DBOS stores workflow state in db it should be easy to build dashboard for observability from metabase/looker(I wish restate exposed its rocksdb instance so it could be hooked up to metabase).

➕ show 1 reply

throwaw12 • yesterday at 6:57 PM

Curious to know experience of people using DBOS and Temporal.

I have used Temporal in the past, works really good, my only problem with it was some limits on request payload or event sizes, created some inconveniences to us when building solutions. It also enforces good engineering practices, but sometimes you don't want to write special logic if your CSV file is larger than 2Mb, upload it to S3, pass link, then download it in the workflow.

What is your experience with DBOS? How does it compare to Temporal in terms of operational complexity, feature parity and anything else

➕ show 7 replies

timwis • today at 11:46 AM

Rails has several database-backed job backends, but the convention is always to make jobs do one thing, and ideally be very short-lived. This makes building workflows a bit contrived: we end up enqueuing the second job on the last line of the first one, enqueuing the third one on the last line of the second one, etc. The job backend treats these as independent jobs rather than showing them as a connected workflow, and you have to read through a bunch of job classes to wrap your head around the workflow at even a high level

Rails recently introduced a 'continuable' concept, allowing you to checkpoint and resume steps within a job, but it still feels like the convention is too keep jobs with a single responsibility, so it feels odd to use them for true workflows.

Has anyone else experienced this or found a solution to it?

opiniateddev • yesterday at 7:22 PM

Conductor OSS does this quite well https://docs.conductor-oss.org/devguide/ai/index.html

https://github.com/agentspan-ai/agentspan which is essentially an agentic SDK layer for Conductor can convert any of your langgraph, openAI, vercel, or ADK agent and makes it durable and adds orchestration with no code changes.

➕ show 1 reply

stuartaxelowen • yesterday at 8:01 PM

My dream is, instead of separating data storage, state machines, valid state constraints, and the logic that transitions between valid states, we can actually unify these into some kernel of app state. Honestly, Postgres already has a lot of these capabilities, but I don’t see an obvious story on the app or product level, providing provably correct sets of states that apps can transition between, and which they can automatically expose to clients in informative ways (this user can like this post, but not edit). It looks colored Petri net shaped to me, but I don’t yet see a simple app state paradigm in the same way that the database has obvious successful boundaries.

➕ show 3 replies

vrm • yesterday at 6:59 PM

Since DBOS doesn't support Rust, we implemented a very minimal Rust version of this at https://github.com/tensorzero/durable. It has been quite stable and extensible but of course you need to be very careful with the SQL implementations. Hope this is interesting to readers here.

➕ show 1 reply

pragma_x • yesterday at 8:26 PM

I completely get the concept and agree - this is great way to build this kind of durability in a workflow system.

That said, my gamer-brain wants to call this "Save-scumming at scale." Which is to say, a lot of people already know that this approach works, but maybe they haven't made the connection to abstract CS stuff.

Another strategy that can be used to build robustness is to build your workflow out of idempotent operations. That can be useful for situations where the workflow state is too large to back up. Instead, you just run the job from the top and it's a bunch of no-ops until you start making progress again.

sgt • yesterday at 6:51 PM

Continuously amazed by what you can do with few tools, as long as Postgres is a part of your toolkit.

I recently developed a distributed queue and it works really great - benchmarks great too, with no race conditions or conflicts. I used SKIP LOCKED so that workers can compete safely.

You can also have multiple workers across nodes avoid conflict by using session wide mutexes i.e. pg advisory lock.

➕ show 2 replies

rossjudson • today at 3:05 AM

This is an excellent pattern; do as much as you can in the database.

External Spanner provides changes streams. Internal spanner is different, mostly because of the extreme scaling requirements in some cases (and a healthy dose of "because it already works" mixed with "arbitrary change streams are scary").

Internal Spanner allows any transaction to write queue entries, where queues are (more or less) tables with some special time awareness. You can schedule delivery. Entries get pushed from queues to a handler which can also do writes to the DB within the dequeue transaction. And all of the same scaling is there.

banditelol • today at 2:42 AM

For some interesting alternative for postgres as queue (actually more like kafka log), I like what pgque does https://github.com/NikolayS/pgque rather than using select for updates and other semantic, it uses snapshot and table truncate to reduce bloat. I havent used it for my dayjob but the different approach is refreshing to see and interesting for different system trade off.

rkeene2 • yesterday at 11:10 PM

I have an implementation I use that has multiple drivers (PostgreSQL, Firestore, SQLite3, just a file, Redis, or an in-memory store) written in TypeScript and it's been working well for my low-scale needs. The interfaces could support interfacing with a dedicated queuing system if you needed to migrate over time.

It supports pipelines, batched pipelines, and basic runners, as well as idempotent keys (including batching them). It also lets you "partition" a queue into multiple sub-queues so that you can easily segregate your jobs within your application without a lot of setup on the outside. For example, you create a root queue talking to PostgreSQL and pass it around to subsystems that then each create their own sub-queue off that to enqueue entries into and their own workers that dequeue them.

It's only used internally right now but I've been thinking about creating a separate package (with documentation) with it for others to use as well. Any feedback or pull requests would be appreciated !

[0] https://github.com/KeetaNetwork/anchor/blob/main/src/lib/que...

[1] https://github.com/KeetaNetwork/anchor/blob/main/src/lib/que...

aryehof • today at 5:23 AM

My fear is that durable workflows are increasingly being seen as required for everything, because we need to solve the distributed transaction problem in a micro-services world.

It questions the initial wisdom of creating lots of little independent distributed apps, without regards to interaction between them. Let’s build ever more necessary plumbing and schemes just to enable their interaction.

I am arguing that durable workflows should be a last resort for boundaries you must cross, not a default pattern for every business process.

buremba • yesterday at 7:39 PM

All you need is Postgres until you scale into TBs of data. We use Postgresql as a durable workflow engine, vector search, time-series data, BM25 search, OLTP/OLAP engine, and a queue. It's basically the only dependency we have for https://lobu.ai

The main benefit is centralizing all the data in one place so we don't need to worry about copying data in between multiple systems. Once something becomes the bottleneck, you can eventually migrate to a purpose specific tool to scale out.To be honest, LISTEN/NOTIFY in my opinion is the most fragile part of PG but it's fine as start until you scale out.

➕ show 7 replies

jgraettinger1 • yesterday at 9:44 PM

At Estuary, we have an in-house Rust crate [1] for building scale-out durable actors / FSMs in Postgres. It powers all async activity in our control plane -- slews of fine-grain scheduled actions, complex change propagation through data-flow topologies, reliable alert and email delivery, and more -- at hundreds to thousands of state transitions per second (today). It's been a wonderful pattern to build on, and is all of three source files.

Here's a an example computing a Fibonacci sequence (very inefficiently, with lots of spawned sub-tasks and message passing) [2]

[1] https://github.com/estuary/flow/tree/master/crates/automatio... [2] https://github.com/estuary/flow/blob/master/crates/automatio...

senderista • yesterday at 6:51 PM

Citing CockroachDB as an example of scaling Postgres made me spit out coffee. Was this LLM-written?

➕ show 2 replies

grahac • yesterday at 7:54 PM

Isn't this Just Oban from elixir? :)

switchbak • yesterday at 6:58 PM

Having inherited a few of these - you tend to home-grow an ad-hoc version of many of the existing OSS tools, but with less of the patterns baked in.

Not sure where the NIH ends and where you're actually better off with a supported orchestration approach. I suppose if you expect your program to be around a while (or need advanced features), maybe think about using something a bit more battle tested?

nzoschke • yesterday at 9:59 PM

I do love Postgres and DBOS.

I also recently started experimenting with https://github.com/earendil-works/absurd which is also Postgres and even simpler than DBOS. Their comparison is a great read:

https://earendil-works.github.io/absurd/comparison/

But for operational reasons I've started using sqlite for durable workflows instead. Porting the database concepts from either DBOS or absurd PG to SQLite is remarkably easy these days. A small polling loop instead of notify/listen feels fine for smaller workloads.

pirsquare • yesterday at 7:09 PM

I feel it's way too hand wavy on consistency and correctness. My opinion as someone who've implemented marketing workflows that breaks all the time (and tons of painful lessons).

Strong correctness guarantee is something that should not be undermine. Even more important than availability.

The examples on the website is simple but heavily undermines the importance of correctness. Anyone who implement similar pseudo-code directly will eventually suffer from data correctness issue in crashes.

  @DBOS.workflow()
  def checkout_workflow(items: Items):
      order = create_order()
      reserve_inventory(order, items)
      payment_status = process_payment(order, items)

      if payment_status == 'paid':
          fulfill_order(order)
      else:
          undo_reserve_inventory(order, items)
          cancel_order(order)

➕ show 1 reply

munk-a • yesterday at 7:29 PM

We have a durable queue built into postgres to handle some complex notification-ish logic. It's worked excellently and while there are services various cloud providers would love to sell us to do that it's extremely cheap to run.

For that particular usage, the volume we process and business criticality make it a good choice for inventing here - but for other durable processes we just use off the shelf tools since the cost of maintenance would quickly outstrip the value.

Postgres is a great tool to use and far more powerful than most people give it credit for - but there's always the balance of in-house maintenance vs. paying rent for someone else's solution.

➕ show 1 reply

halamadrid • yesterday at 8:24 PM

We work on disk log based architecture for workflows at Unmeshed (https://unmeshed.io/) which helps it to scale at a fraction of the cost of traditional workflow systems that are based on expensive databases.

Postgres is not cheap to run in the cloud at scale. We went for the cheapest infra, which is basically the disk storage.

hedora • yesterday at 10:47 PM

I want to dig into this "free" workflow_error.sql. I'll assume 1024 byte workflow job descriptors, and the article's steady state of 10,000 jobs per second.

Possibility one: There is one index on the table, and it is the created_at TS. This query has to scan 10,000 jobs/sec * 60 seconds * 60 minutes * 24 hours * 31 days * 1024 bytes / job = 25,543 GB.

A KV store would scan exactly that much.

Possibility two: The primary key is refined to (state, timestamp). Assume a 1% failure rate. Now, we "only" scan and return 255 GB. A key value store would scan exactly that much. (This is probably the right physical design).

Possibility three: The primary key is (timestamp), and there's a secondary index on state. I guess we do an index join, where one side of the join is 25,543 GB, and the other side is one unsorted bucket with 255GB * number of months the system has been in operation in it.

A KV store wouldn't let you express that.

Now, what other ad hoc queries are we supposed to efficiently support over a one month lookback? Also, what does PG do if you tell it to scan 25TB at the same time as it's inserting 10MB/sec at 10K TPS? How is vacuuming configured?

magicseth • yesterday at 7:25 PM

Convex has a workpool component that gives the ability to compose big complicated flows in an understandable way, and give you realtime updates on status of various pieces: https://www.convex.dev/components/workflow

farsa • yesterday at 11:34 PM

Making the workflow engine of DBOS depend on the paid component (Conductor) for scaling and recovery makes it a no-go. River also has "traps" like not supporting DLQ, which is a paid feature.

thesmart • yesterday at 11:29 PM

So we're back to distributed queues on PostgreSQL circa 2006...

iwwff • yesterday at 9:35 PM

Every time I am surprised to see the promises of the cheap durability without mentioning costs of running dirable postgres, which might be not easy or cheap.

➕ show 1 reply

elliot07 • yesterday at 7:35 PM

how is this compared to hatchet?

hbarka • yesterday at 7:09 PM

How do you incorporate secrets in this kind of implementation? Stored in db?

➕ show 2 replies

Thaxll • today at 1:20 AM

All those solutions based on PG are missing the point, you need a good SDK so that devs can create those workflow without re-inventing the wheel: error, retries, observability, idempotency ect ...

PG is just a detail of implementation, you need a good library to build reliable flows.

➕ show 2 replies

llmslave • yesterday at 7:15 PM

Temporal is an insane piece of software, always surprised people dont know about it. You could replace almost youre whole AWS stack with temporal

➕ show 2 replies

rafael-lua • yesterday at 8:24 PM

The "everything can be done in Postgres" crowd is crazy. It is like a religion at this point.

➕ show 2 replies

cpursley • yesterday at 7:03 PM

PgFlow is pretty awesome for DAG workflows - it's built on pgmq (which does the heavy lifting, making it backend agnostic).

Typescript: https://www.pgflow.dev

Elixir: https://github.com/agoodway/pgflow/blob/main/docs/COMPARISON...

epolanski • yesterday at 11:12 PM

I don't get how any of the points made in this blog post would not work if you replaced postgres with MySQL or cosmosdb.

In any case there can be more to durable workflows than just saving the current step, and not all intermediate steps are serializable thus I don't get where's the postgres magic that more mature solutions don't have.

OutOfHere • yesterday at 7:44 PM

I am not convinced that using a special software for "durable workflows" is necessary. If one has a stateful message queue or job task queue, e.g. RabbitMQ or Celery, one can use it. Irrespective, many jobs can be made idempotent. The most that you ought to residually need is a column in an existing table of your own database which keeps track of what remains to be done.

Given the above, it would seem that durable workflow software is pushed forward by those who have a surplus of VC money to spend. As for the vendors, there is no shortage of people trying to sell you things that you don't need.

➕ show 1 reply

ryanshrott • today at 2:22 PM

[flagged]

Bolin-Weng_666 • today at 7:17 AM

[dead]

eddysir • today at 4:51 AM

[dead]

doginasuit • yesterday at 9:24 PM

I have only used two databases, SQLite and Postgres, depending on if the database needs to be bundled with the application. They both feel like magic. Even though I am not a religious person, I recognize the value in acknowledging a higher power.

alt Hacker News

Building durable workflows on Postgres

Comments