The usual path an engineer takes is to take a complex and slow system and reengineer it into something simple, fast, and wrong. But as far as I can tell from the description in the blog though, it actually works at scale! This feels like a free lunch and I’m wondering what the tradeoff is.
the tradeoff is in the failure boundary. CAS on object storage gets you atomic single-object writes, but if you need to update two objects (e.g. dequeue + update a processing log), you're back to application-level coordination. works great while your queue fits in one file's worth of CAS semantics; starts hurting when you need multi-object atomicity.
Write amplification >9000 mostly
It seems like this is an approach that trades off scale and performance for operational simplicity. They say they only have 1GB of records and they can use a single committer to handle all requests. Failover happens by missing a compare-and-set so there's probably a second of latency to become leader?
This is not to say it's a bad system, but it's very precisely tailored for their needs. If you look at the original Kafka implementation, for instance, it was also very simple and targeted. As you bolt on more use cases and features you lose the simplicity to try and become all things to all people.