(I help maintain SmartStack) I think it's really interesting that "what we've al...

jacquesm · on Nov 1, 2016

> We fixed that with a heartbeat and a watchdog, so not too bad.

I disagree. That's a band-aid solution, good for a short time while you figure out the root cause and solve it for real.

jolynch · on Nov 1, 2016

I respectfully disagree. I'm all for root cause analysis and taking the time to fix things upstream, but I also think that it's easy to say that and hard to actually do it.

Yelp doesn't make more money and our infra isn't particularly more maintainable when I invest a few weeks debugging Ruby interpreter/library bugs, especially not when there are thousands of other higher priority bugs I could be determining the root cause of and fixing.

For context, we spent a few days trying to get a reproducible test case for a proper report upstream, but the issue was so infrequent and hard to reproduce that we made the call not to pursue it further and just mitigate it. I do believe that mitigate rather than root cause is sometimes the right engineering tradeoff.

jacquesm · on Nov 1, 2016

A bug like that is something that you want to squash because the cause might have other unintended consequences that you are currently un-aware of. To assume that there are no other consequences is the error, and the only way to make sure there are not is to identify the cause. This sort of wiping things under the carpet is what comes back to bite you a long time after either with corrupted data or some other consequence.

Now, given the context it doesn't matter whether or not the company or the product dies so I can see where you're coming from but in any serious enterprise that would not be tolerated, but when your code base already has 'thousands of other higher priority bugs' it's a lost cause, point taken. But at some level you have to wonder whether you have 'thousands of higher priority bugs' because there is such a cavalier attitude to fixing them in the first place.

jolynch · on Nov 1, 2016

> in any serious enterprise that would not be tolerated

I think that's a bit of a true scotsman fallacy. We use a lot of software we didn't write, and a lot of it has bugs. The languages that we write code in have bugs (e.g. Python has a fun bug where the interpreter hangs forever on import [1]; we've hit this in production many times). Software we write has bugs and scalability issues as well. We try really hard to squash them. We have design reviews, code reviews, and strive to have good unit, integration, and acceptance tests. There are still bugs.

I'm glad that there are some pieces of software that are written to such high standard that bugs are extremely rare (I think that HAProxy is a great example of such a project), but I know of very few in the real world.

[1] https://bugs.python.org/issue14903