Hacker News new | past | comments | ask | show | jobs | submit login

I've been happily running a service that's non-critical, only to discover when we have an outage (that should be a non-event) that another team has started relying on it for something business critical.



This was famously a problem for Google's distributed lock service, Chubby. They handled it by intentionally having outages to flush out ways it might have started to bear loads it wasn't designed for: https://sre.google/sre-book/service-level-objectives/#xref_r...


I'm a fan of the 'chaos monkey' (Netflix software) approach of this.

Can't expect your platform to be reliable, if it just breaks at random.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: