I've been happily running a service that's non-critical, only to discover when w...

jefftk · on July 20, 2023

This was famously a problem for Google's distributed lock service, Chubby. They handled it by intentionally having outages to flush out ways it might have started to bear loads it wasn't designed for: https://sre.google/sre-book/service-level-objectives/#xref_r...

throwawaymobule · on July 22, 2023

I'm a fan of the 'chaos monkey' (Netflix software) approach of this.

Can't expect your platform to be reliable, if it just breaks at random.