logoalt Hacker News

dlenskitoday at 3:15 AM7 repliesview on HN

The idea that AWS's services are fully regionalized or isolated has always been a myth.

All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.

And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.


Replies

Roark66today at 9:42 AM

Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.

But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.

show 1 reply
myroon5today at 11:13 AM

> outside of China

[Nitpick] There are a few more AWS partitions like GovCloud:

https://jasonbutz.info/2023/07/aws-partitions/

master_crabtoday at 11:53 AM

IAM isn’t even really the most painful dependency. Route53 is. The control plane only runs out of use1.

Better make sure the only DNS operations you run during an outage are data plane queries and health check failovers.

show 1 reply
zaphirplanetoday at 9:02 AM

Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?

show 1 reply
sidewndr46today at 3:19 AM

Isn't this kind of circular dependency what lead to extended downtime a while back?

show 3 replies
stephenrtoday at 5:07 AM

> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

When you dogfood your own Rube Goldberg machine.

show 1 reply
jmsgwdtoday at 11:39 AM

> The idea that AWS's services are fully regionalized or isolated has always been a myth.

This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated or globally distributed.[1]

The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.

Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).

> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region [3]. The IAM data plane, which enforces access control, is also regional.

If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."

[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."

[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."