May 2024 UniSuper incident: https://cloud.google.com/blog/products/infrastructure/detail...
https://www.unisuper.com.au/about-us/media-centre/2024/a-joi...
A joint statement from UniSuper CEO Peter Chun and Google Cloud CEO Thomas Kurian
8 May 2024
UniSuper and Google Cloud understand the disruption to services experienced by members has been extremely frustrating and disappointing. We extend our sincere apologies to all members.
While supporting UniSuper to bring its systems back online, Google Cloud has been conducting a root cause analysis.
Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events, where an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription.
This is described as an isolated, “one-of-a-kind occurrence” that has never before occurred with any Google Cloud client globally. This should not have happened. Google Cloud has identified the sequence of events and taken measures to ensure it does not happen again.
Why did the outage last so long?
UniSuper had duplication across two geographies as protection against outages and data loss. However, the deletion of the Private Cloud subscription triggered deletion across both geographies.
Restoring the Private Cloud required significant coordination and effort between UniSuper and Google Cloud, including recovery of hundreds of virtual machines, databases, and applications.
The instant cascading worldwide deletion upon closing or deleting a subscription sounds like a recipe for disaster. Why not mark it for deletion and delete say... a day or a week later?
I wrote about the UniSuper issue at the time: https://danielcompton.net/google-cloud-unisuper. It was a pretty nasty bug where their VMWare environment was created with a one-year expiry date, but was one "resource" from the perspective of Google Cloud.