Vaguely related anecdote: 30 years ago or so I worked at a tiny networking compa...

kevin_nisbet · 2024-11-13T19:15:50 1731525350

This is along the lines of how one of the wireless telecom products I really liked worked.

Each running process had a backup on another blade in the chassis. All internal state was replicated. And the process was written in a crash only fashion, anything unexpected happened and the process would just minicore and exit.

One day I think I noticed that we had over a hundred thousand crashes in the previous 24 hours, but no one complained and we just sent over the minicores to the devs and got them fixed. In theory some users would be impacted that were triggering the crashes, their devices might have a glitch and need to re-associate with the network, but the crashes caused no widespread impacts in that case.

To this day I'm a fan of crash only software as a philosophy, even though I haven't had the opportunity to implement it in the software I work on.

cruffle_duffle · 2024-11-13T18:01:30 1731520890

Seems like the next priority would be to make your product reboot just as fast if not faster then theirs.

rtkwe · 2024-11-13T18:44:23 1731523463

Clearly but maybe the thing that makes your product crash less makes it take longer to reboot.

Also the story isn't that they couldn't just that they were measuring the actual failure rate not the effective failure rate because the device could recover faster than the failure caused actual issues.