Failing over to another AWS region is actually pretty difficult for stateful ser...

mdavidn · on Dec 8, 2021

A one-day outage in December can be crippling for retail.

I don't doubt that many functions are difficult to failover, but a bare-bones minimum seems straightforward. For example, evidence of delivery is append-only and only needs to be globally consistent later, after a dispute.

hef19898 · on Dec 8, 2021

I totally understand why Amazon halted everything. Sure, one could deliver shipments off-line and sort them out afterwards manually. And at a lower scale Amazon might have tried it (at least it would have been on the table when I worked there a couple of years ago).

But what then? You had a complete loss of traceability of shipments and operations, once you regain it, a junk of shipments isn't there anymore where there are supposed to be. No you have not one potential root cause for this, the outage, that could be resolved by retriggering those shipments (not loss, as you didn't deliver anything to customers) but two: the outage and some off-line shipments. In case it was just one FC, sure that would be doable. If the whole network in a complete region goes down, no way to handle that. It is much easier and safer to just stop operations until the outage is resolved, re-route orders to other regions in the meantime, and then work through the backlog. Amazon's ops are good at that, specifically because they have almost complete transparency on their material flows. Going off-line would have jeopardized that transparency, making a quick recovery after the outage all the harder.

trogdor · on Dec 8, 2021

Any idea if Amazon has insurance to cover this type of event?

hef19898 · on Dec 8, 2021

Can't speak for Amazon. Generally so I don't see how one could insure against it. I know that e.g. Allianz offers policies against IT outages. In that case so, what is the actual damage? Probably the delivery drivers paid without delivering and salaries, plus potential overtime to solve the backlog. Depending on the conditions a company the size of Amazon would get, maybe it's not worth it.

techie128 · on Dec 8, 2021

Databases, stateful part of services, have matured to have multi-region support. This isn't new either. Nobody's saying that its easy to have multi-region redundancy for stateful services. Its just something you need to have to prevent nasty single region outages affect your service. This is an excellent example where it would have been better to have degraded performance (higher latency) instead of complete unavailability and interruption in business.

pojzon · on Dec 8, 2021

Tell that to management of a medium company showing them the bill for something that has 0.01% chance of happening according to AWS..

Not every workload is of the micro size.

MR up our databases would cost around 15mil per year for a company that makes 50mil..

X3AY7yZfpyWCmf · on Dec 8, 2021

I don't think it's correct to call Amazon a "medium company" and last time I checked they make more than 50mil

ImPostingOnHN · on Dec 8, 2021

I don't think the parent's "our company" is Amazon, because Amazon does indeed make more than $50m, but I can attest to cloud provider multi region replication being extremely expensive for us (also not Amazon), if only due to data transfer costs

comboy · on Dec 8, 2021

Can you elaborate? E.g. postgres replication is pretty straightforward and not a new technology. I'm outside AWS ecosystem and with just dedicated boxes having some DC burn down is manageable. How do magic clouds make that hard?

aniforprez · on Dec 8, 2021

"postgres replication" would probably be the least of their worries. It's not about "magic clouds". These are services that are handling millions of requests per second and there's a lot going on where they have to maintain consistency and fail predictably. Having some services go down in one region but being back up in another still serving requests and committing transactions is unpredictable and could create a lot of inconsistencies that would be very difficult to resolve later especially when you have customer facing services like this where someone's package could be lost resulting in bad reviews and other things you don't want to deal with. People here mostly have never even imagined the level of workloads they're handling and are throwing around "easy" solutions like replication or multi-region availability. For something of this scale, it's just not that simple. It would also be incredibly expensive to do this when you could simply shut operations down for a brief period of time. Not like something of this scale has happened that often

comboy · on Dec 8, 2021

OK, but this is programmer to manager explanation. I get how computers work. I know simple things can get very complex at scale. I just thought that the huge extra you pay for cloud services is mostly for battle tested solutions for these scale problems.

As far as I understand it, if you design is sane, the database/storage handles fallback and recovery.

Or maybe in other words - you need to make your service handle single machine going down without any problem - cloud or not. And there seem to be two options - it's your machine or part of a service which AWS provides to you. In second case it's on AWS to handle that and in the first case shouldn't AWS make it such that for you DC is just a parameter and they handle all virtual network and other magic?

To be super clear - I'm not arguing, just trying to learn I would love some specific examples which make the problem hard, because all these stories make me stay away from cloud which in theory is solution well worth paying extra for in a bigger company context.

stickfigure · on Dec 8, 2021

State ("the database") is hard in distributed systems. Was the package picked up or is it still sitting on a shelf waiting? If your distributed system is partitioned, different queries may give different answers and your warehouse workers are going to be running around looking for boxes that aren't there.

Even if you create a system that is eventually consistent when availability is restored (a difficult problem all by itself, and probably needs a lot of application layer logic), it may not be worth the trouble. Warehouse workers interact with the "perfect" state of the real world, and if the computers don't have access to that, they aren't very useful.

_0w8t · on Dec 8, 2021

The proper fallback is costly both hardware-wise and in development efforts. It can be cheaper just skip it and tolerate occasional service unavailability.

b112 · on Dec 8, 2021

A correctly designed infra and app, will have zero issues with hard failover.

It has been my experience however, that as more people use the cloud, all that "ease of use" both adds layers of complexity, and further, abstracts the backend away.

Thus, by outsourcing sysadmin tasks to AWS, no in house experise exists. People don't know how to handle correct failover, unless the platform 100% does it all for them.

sofixa · on Dec 8, 2021

Is manageable depends entirely on the scenario and needs. Maybe you can afford a 5 minute data loss, but can Amazon? Also it's quite possible that they have an enormous volume of data, complicating everything.

And there's probably more than just a database. Maybe a message queue for asynchronous treatment, object storage for photos, etc.

It's certainly not an insurmountable problem, but maybe they consider the failure rate so low (it is, us-east-1 going down is like a once in a few years event) that the complexity of multi-region isn't worth it.