30 years ago or so I worked at a tiny networking company where several coworkers came from a small company (call it C) that made AppleTalk routers. They recounted being puzzled that their competitor (company S) had a reputation for having a rock-solid product, but when they got it into the lab they found their competitor's product crashed maybe 10 times more often than their own.
It turned out that the competing device could reboot faster than the end-to-end connection timeout in the higher-level protocol, so in practice failures were invisible. Their router, on the other hand, took long enough to reboot that your print job or file server copy would fail. It was as simple as that, and in practice the other product was rock-solid and theirs wasn't.
(This is a fairly accurate summary of what I was told, but there's a chance my coworkers were totally wrong. The conclusion still stands, I think - fast restarts can save your ass.)
This is along the lines of how one of the wireless telecom products I really liked worked.
Each running process had a backup on another blade in the chassis. All internal state was replicated. And the process was written in a crash only fashion, anything unexpected happened and the process would just minicore and exit.
One day I think I noticed that we had over a hundred thousand crashes in the previous 24 hours, but no one complained and we just sent over the minicores to the devs and got them fixed. In theory some users would be impacted that were triggering the crashes, their devices might have a glitch and need to re-associate with the network, but the crashes caused no widespread impacts in that case.
To this day I'm a fan of crash only software as a philosophy, even though I haven't had the opportunity to implement it in the software I work on.
Clearly but maybe the thing that makes your product crash less makes it take longer to reboot.
Also the story isn't that they couldn't just that they were measuring the actual failure rate not the effective failure rate because the device could recover faster than the failure caused actual issues.
30 years ago or so I worked at a tiny networking company where several coworkers came from a small company (call it C) that made AppleTalk routers. They recounted being puzzled that their competitor (company S) had a reputation for having a rock-solid product, but when they got it into the lab they found their competitor's product crashed maybe 10 times more often than their own.
It turned out that the competing device could reboot faster than the end-to-end connection timeout in the higher-level protocol, so in practice failures were invisible. Their router, on the other hand, took long enough to reboot that your print job or file server copy would fail. It was as simple as that, and in practice the other product was rock-solid and theirs wasn't.
(This is a fairly accurate summary of what I was told, but there's a chance my coworkers were totally wrong. The conclusion still stands, I think - fast restarts can save your ass.)