It seems like this is an approach that trades off scale and performance for operational simplicity. They say they only have 1GB of records and they can use a single committer to handle all requests. Failover happens by missing a compare-and-set so there's probably a second of latency to become leader?
This is not to say it's a bad system, but it's very precisely tailored for their needs. If you look at the original Kafka implementation, for instance, it was also very simple and targeted. As you bolt on more use cases and features you lose the simplicity to try and become all things to all people.
> Failover happens by missing a compare-and-set so there's probably a second of latency to become leader?
Conceptually that makes sense. How complicated is it to implement this failover logic in a safe way? If there are two processes, competing for CAS wins, is there not a risk that both will think they're non-leaders and terminate themselves?
(author here)
> It seems like this is an approach that trades off scale and performance for operational simplicity.
Yes, this is exactly it. Given that turbopuffer itself is built on the idea of object storage + stateless cache, we're all very comfortable dealing with it operationally. This design is enough for our needs and is much easier to be oncall for than adding an entirely new dependency would have been.