This is a bit odd coming from the company of chaos engineering - has the chaos monkey been abandoned at Netflix?
I have long advocated randomly restarting things with different thresholds partly for reasons like this* and to ensure people are not complacent wrt architecture choices. The resistance, which you can see elsewhere here, is huge, but at scale it will happen regardless of how clever you try to be. (A lesson from the erlang people that is often overlooked).
* Many moons ago I worked on a video player which had a low level resource leak in some decoder dependency. Luckily the leak was attached to the process, so it was a simple matter of cycling the process every 5 minutes and seamlessly attaching a new one. That just kept going for months on end, and eventually the dependency vendor fixed the leak, but many years later.
In cases like this won't Chaos Monkey actually hide the problem, since it's basically doing exactly the same as their mitigation strategy - randomly restarting services?
Right. The point of the question is why not ramp up the monkey? They seem to imply it isn’t there now, which wouldn’t surprise me with the cultural shifts that have occurred in the tech world.
I have long advocated randomly restarting things with different thresholds partly for reasons like this* and to ensure people are not complacent wrt architecture choices. The resistance, which you can see elsewhere here, is huge, but at scale it will happen regardless of how clever you try to be. (A lesson from the erlang people that is often overlooked).
* Many moons ago I worked on a video player which had a low level resource leak in some decoder dependency. Luckily the leak was attached to the process, so it was a simple matter of cycling the process every 5 minutes and seamlessly attaching a new one. That just kept going for months on end, and eventually the dependency vendor fixed the leak, but many years later.