That's how we got 99.99% at Netflix. And it cost a lot of money. But a canary implies that something may go wrong and you have to roll back. The canary is still production traffic, so some transactions would fail, which isn't allowed for this kind of workload.
I image you'd have to use shadow execution, where you roll out a full second copy, run every transaction through both, and compare the results. And then, only after a certain time, switch traffic to the new infra and tear down the old.
But you would need a ton of extra hardware (more than double) and a lot of ways to keep data in sync. And of course if you put an LLM or other non-deterministic system in there, that's a whole other can of worms.
Like I said, a fun problem to solve. :)
Folks that keep the lights on 24/7 aka SREs are super heroes that wear capes. Thank you for your service.
I couldn’t do it. I like infra and all but it’s just not my cup of tea. Def true that in a trading pov the trade must be executed. It must settle. It must work. Or capital flight will be huge.