I've been super interested in this field since finding out about it from the `sled` simulation guide [0] (which outlines how FoundationDB does what they do).
Currently bringing a similar kind of testing in to our workplace by writing our services to run on top of `madsim` [1]. This lets us continue writing async/await-style services in tokio but then (in tests) replace them with a deterministic executor that patches all sources of non-determinism (including dependencies that call out to the OS). It's pretty seamless.
The author of this article isn't joking when they say that the startup cost of this effort is monumental. Dealing with every possible source of non-determinism, re-writing services to be testable/sans-IO [2], etc. takes a lot of engineering effort.
Once the system is in place though, it's hard to describe just how confident you feel in your code. Combined with tools like quickcheck [3], you can test hundreds of thousands of subtle failure cases in I/O, event ordering, timeouts, dropped packets, filesystem failures, etc.
This kind of testing is an incredibly powerful tool to have in your toolbelt, if you have the patience and fortitude to invest in it.
As for Antithesis itself, it looks very very cool. Bringing the deterministic testing down the stack to below the OS is awesome. Should make it possible to test entire systems without wiring up a harness manually every time. Can’t wait to try it out!
> you can test hundreds of thousands of subtle failure cases in I/O, event ordering, timeouts, dropped packets, filesystem failures, etc.
As cool as all this is, I can't stop but wonder how often the culture of micro-services and distributed computing is ill advised. So much complexity I've seen in such systems boils down to calling a "function" is: async, depends on the OS, is executed at some point or never, always returns a bunch of strings that need to be parsed to re-enter the static type system, which comes with its own set of failure modes. This makes the seemingly simple task of abstracting logic into a named component, aka a function, extremely complex. You don't need to test for any of the subtle failures you mentioned if you leave the logic inside the same process and just call a function. I know monoliths aren't always a good idea or fit, at the same time I'm highly septical whether the current prevalence of service based software architectures is justified and pays off.
> I can't stop but wonder how often the culture of micro-services and distributed computing is ill advised.
You can't get away from distributed computing, unless you get away from computing. A modern computer isn't a single unit, it's a system of computers talking to each other. Even if you go back a long time, you'll find many computers or proto-computers talking to each other, but with a lot stricter timings, as the computers are less flexible.
If you save a file to a disk, you're really asking the OS (somehow) to send a message to the computer on the storage device, asking it to store your data, and it will respond with success or failure and it might also write the data. (sometimes it will tell your os success and then proceed to throw the data away, which is always fun)
That said, keeping things together where it makes sense, is definitely a good thing.
I see your point. Even multithreading can be seen as a form of distributed programming. At the same time, in my experience these parts can often be isolated. You trust your DB to handle such issues, and I'm very happy we are getting a new era of DBs like Tigerbetle, FoundationDB and sled that are designed to survive Jepsen. But how many teams are building DBs? That point is a bit ironic, given I'm currently building an in-memory DB at work. But it's a completely different level of complexity. And your example with writing a file, that too is a somewhat solved problem, use ZFS. I'd argue there are many situations where the fault tolerant distributed requirements can be served by existing abstractions.
TigerBeetle is actually another customer of ours. You might ask why, given that they have their own, very sophisticated simulation testing. The answer is that they're so fanatical about correctness, they wanted a "red team" for their own fault simulator, in case a bug in their tests might hide a bug in their database!
I gotta say, that is some next-level commitment to writing a good database.
Sure! I mentioned a few orthogonal concepts that go well together, and each of the following examples has a different combination that they employ:
- the company that developed Madsim (RisingWave) [0] [1] is tries hardest to eliminate non-determinism with the broadest scope (stubbing out syscalls, etc.)
- sled [2] itself has an interesting combo of deterministic tests combined with quickcheck+failpoints test case auto-discovery
- Dropbox [3] uses a similar approach but they talk about it a bit more abstractly.
Sans-IO is more documented in Python [4], but str0m [5] and quinn-proto [6] are the best examples in Rust I’m aware of. Note that sans-IO is orthogonal to deterministic test frameworks, but it composes well with them.
With the disclaimer that anything I comment on this site is my opinion alone, and does not reflect the company I work at —— I do work at a rust shop that has utilized these techniques on some projects.
TigerBeetle is an amazing example and I’ve looked at it before! They are really the best example of this approach outside of FoundationDB I think.
Currently bringing a similar kind of testing in to our workplace by writing our services to run on top of `madsim` [1]. This lets us continue writing async/await-style services in tokio but then (in tests) replace them with a deterministic executor that patches all sources of non-determinism (including dependencies that call out to the OS). It's pretty seamless.
The author of this article isn't joking when they say that the startup cost of this effort is monumental. Dealing with every possible source of non-determinism, re-writing services to be testable/sans-IO [2], etc. takes a lot of engineering effort.
Once the system is in place though, it's hard to describe just how confident you feel in your code. Combined with tools like quickcheck [3], you can test hundreds of thousands of subtle failure cases in I/O, event ordering, timeouts, dropped packets, filesystem failures, etc.
This kind of testing is an incredibly powerful tool to have in your toolbelt, if you have the patience and fortitude to invest in it.
As for Antithesis itself, it looks very very cool. Bringing the deterministic testing down the stack to below the OS is awesome. Should make it possible to test entire systems without wiring up a harness manually every time. Can’t wait to try it out!
[0]: https://sled.rs/simulation.html
[1]: https://github.com/madsim-rs/madsim?tab=readme-ov-file#madsi...
[2]: https://sans-io.readthedocs.io/
[3]: https://github.com/BurntSushi/quickcheck?tab=readme-ov-file#...