I think the reasoning matters as much as the answer, and you had to make at leas...

I think the reasoning matters as much as the answer, and you had to make at least a couple strange turns to get the "right answer" that retries don't solve the problem:

* the 3rd-party component offering only 90% success—I've never actually seen a system that bad. 99.9% success SLA is kind of the minimum, and in practice any system that has acceptable mean and/or 99%/99.9% latency for a critical auth path also has >=99.99% success in good conditions (even if they don't promise refunds based on that).

* the whole "really reliable retry handler" thing—as mentioned in my first comment, I don't understand what you were getting at here.

I would go a whole other way with this section—more realistic, much shorter. Let's say you want to offer 99.999% success within 1 second, and the third-party component offers 99.9% success per try. Then two tries gives you 99.9999% success if the failures are all uncorrelated but retries do not help at all when the third-party system is down for minutes or hours at a time. [1] Thus, you need to involve an alternative that is believed to be independent of the faulty system—and the primary tool AWS gives you for that is regional independence. This sets up the talk about regional failover much more quickly and with less head-scratching. I probably would have made it through the whole article yesterday even in my feverish state.

[1] unless this request can be done asynchronously, arbitrarily later, in which case the whole chain of thought afterward goes a different way.