All 3 hyperscalers have vulnerabilities in their control planes: they're either single point of failure like AWS with us-east-1, or global meaning that a faulty release can take it down entirely; and take AZ resilience to mean that existing compute will continue to work as before, but allocation of new resources might fail in multi-AZ or multi-region ways.
It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.
> It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.
In a way. It means that you can get new capacity most often, but the transition windows where a service gets resized (or mutated in general) has to be minimised and carefully controlled by ops.
Yeah I remember one maybe four years ago? Existing workloads were fine but I had to go and tell my marketing department to not do anything until it was sorted because auto-scaling was busted.
It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.