Interesting read, the fix seems to be straightforward, but I'd have a few more questions if I was trying to do something similar.
Is software deployed regularly on this cluster? Does that deployment happen faster than the rate at which they were losing CPUs? Why not just periodically force a deployment, given it's a repeated process that probably already happens frequently.
What happens to the clients trying to connect to the stuck instances? Did they just get stuck/timeout? Would it have been better to have more targeted terminations/full terminations instead?
An answer to basically all your questions is: doesn’t matter, they did their best to stabilize in a short amount of time, and it worked - that’s what mattered.
Is software deployed regularly on this cluster? Does that deployment happen faster than the rate at which they were losing CPUs? Why not just periodically force a deployment, given it's a repeated process that probably already happens frequently.
What happens to the clients trying to connect to the stuck instances? Did they just get stuck/timeout? Would it have been better to have more targeted terminations/full terminations instead?