This is a good talk. Really gets into the details of how things differ from the classical SaaS or consumer product.
I've been doing reliability for most of my career, and have always been able to hide behind, "We're not a bank, if we lose a few requests it doesn't matter". They can't do that. :)
One advantage that they have is that the market closes, so they can do maintenance that takes the whole system down, but when you're running a global consumer product, it's a lot harder to do that without pushback.
So for most of us, our stress is around zero downtime maintenance, and theirs is around never dropping a request when the system is live.
Not sure what the practical difference is (24/7 vs ~10/5) except for the convenience when planning data migrations if you have regularly planned downtime.
For most code changes being turned off at night isn't much of an advantage, as the new code will need to go live at some point and that point is where the risk is. For systems on 24/7 you simply need a copy of your production environment to test on, a.k.a. staging.
The main thing about 24/7 is needing follow-sun SRE and/or out of hours oncall.
there’s a move now towards 24/7 trading. I guess we’ll see how the rigors of the trading environment mesh with zero down time. I’m sure the rollout will be slow and steady.
Yeah, I work on systems with reliability requirements like this at a large bank.
There are multiple layers of controls and manual interventions and things, which while absolutely painful, slow, expensive and shitstorm-conjuring -- are ultimately the final authority on some failures.
For e.g, in payments -- every single settlement or clearing anomaly is looked at by a real human, and rectified/rebooked manually.
So, yeah, the stakes can be really high when you have a couple billion in memory on your server, but -- it's just a system.
And it will fail, and we plan for it to do so.