Can someone that's worked at one of these big companies honestly explain how it happens that when these guys are down, it's never for like 10-15 mins ... it's always 1-2+ hours? Do they not have mechanisms in place to revert their migrations and deployments? What goes on behind the scenes during these "outages"?
Quick fixes have tendencies to break other stuff and just make matters worse. Better to leave it offline for a little longer, fix the definitive root issue and make sure it comes online nicely. If the issue was just a quirk in a recent deployment then these probably can be reverted easily on the endpoints where they were just deployed (I'm sure they are using staggered roll-outs). These long term downtime things are probably not issues related to a recent release.
You will run into thundering herd/hotspotting/pre-warmed caching issues when you have to restart. There's generally not an easy to way to switch these sorts of systems on and off, especially a relatively new system that isn't battle-hardened.
I got nothing for the github outages this year though, that seems like incompetence.
Well when the coding agents go down who are they supposed to ask what the problem is?
They should probably buy subscriptions to those Chinese agents.
Part of it observability bias: longer, more widespread outages are more likely to draw signficant attention. This doesn't mean that there aren't also shorter, smaller-scope outages, it's just that we're much less likely to know about them.
For example, if there's a problem that gets caught at the 1% stage of a staged rollout, we're probably not going to find ourselves discussing it on HN.