I've been super interested in this field since finding out about it from the `sl...

Voultapher · on Feb 13, 2024

> you can test hundreds of thousands of subtle failure cases in I/O, event ordering, timeouts, dropped packets, filesystem failures, etc.

As cool as all this is, I can't stop but wonder how often the culture of micro-services and distributed computing is ill advised. So much complexity I've seen in such systems boils down to calling a "function" is: async, depends on the OS, is executed at some point or never, always returns a bunch of strings that need to be parsed to re-enter the static type system, which comes with its own set of failure modes. This makes the seemingly simple task of abstracting logic into a named component, aka a function, extremely complex. You don't need to test for any of the subtle failures you mentioned if you leave the logic inside the same process and just call a function. I know monoliths aren't always a good idea or fit, at the same time I'm highly septical whether the current prevalence of service based software architectures is justified and pays off.

toast0 · on Feb 14, 2024

> I can't stop but wonder how often the culture of micro-services and distributed computing is ill advised.

You can't get away from distributed computing, unless you get away from computing. A modern computer isn't a single unit, it's a system of computers talking to each other. Even if you go back a long time, you'll find many computers or proto-computers talking to each other, but with a lot stricter timings, as the computers are less flexible.

If you save a file to a disk, you're really asking the OS (somehow) to send a message to the computer on the storage device, asking it to store your data, and it will respond with success or failure and it might also write the data. (sometimes it will tell your os success and then proceed to throw the data away, which is always fun)

That said, keeping things together where it makes sense, is definitely a good thing.

Voultapher · on Feb 14, 2024

I see your point. Even multithreading can be seen as a form of distributed programming. At the same time, in my experience these parts can often be isolated. You trust your DB to handle such issues, and I'm very happy we are getting a new era of DBs like Tigerbetle, FoundationDB and sled that are designed to survive Jepsen. But how many teams are building DBs? That point is a bit ironic, given I'm currently building an in-memory DB at work. But it's a completely different level of complexity. And your example with writing a file, that too is a somewhat solved problem, use ZFS. I'd argue there are many situations where the fault tolerant distributed requirements can be served by existing abstractions.

michael_j_ward · on Feb 13, 2024

> Dealing with every possible source of non-determinism, re-writing services to be testable/sans-IO [2], etc. takes a lot of engineering effort.

Are there public examples of what such a re-write looks like?

Also, are you working at a rust shop that's developing this way?

Final Note, TigerBeetle is another product that was written this way.

wwilson · on Feb 13, 2024

TigerBeetle is actually another customer of ours. You might ask why, given that they have their own, very sophisticated simulation testing. The answer is that they're so fanatical about correctness, they wanted a "red team" for their own fault simulator, in case a bug in their tests might hide a bug in their database!

I gotta say, that is some next-level commitment to writing a good database.

Disclosure: Antithesis co-founder here.

indiv0 · on Feb 13, 2024

Sure! I mentioned a few orthogonal concepts that go well together, and each of the following examples has a different combination that they employ:

- the company that developed Madsim (RisingWave) [0] [1] is tries hardest to eliminate non-determinism with the broadest scope (stubbing out syscalls, etc.)

- sled [2] itself has an interesting combo of deterministic tests combined with quickcheck+failpoints test case auto-discovery

- Dropbox [3] uses a similar approach but they talk about it a bit more abstractly.

Sans-IO is more documented in Python [4], but str0m [5] and quinn-proto [6] are the best examples in Rust I’m aware of. Note that sans-IO is orthogonal to deterministic test frameworks, but it composes well with them.

With the disclaimer that anything I comment on this site is my opinion alone, and does not reflect the company I work at —— I do work at a rust shop that has utilized these techniques on some projects.

TigerBeetle is an amazing example and I’ve looked at it before! They are really the best example of this approach outside of FoundationDB I think.

[0]: https://risingwave.com/blog/deterministic-simulation-a-new-e...

[1]: https://risingwave.com/blog/applying-deterministic-simulatio...

[2]: https://dropbox.tech/infrastructure/-testing-our-new-sync-en...

[3]: https://github.com/spacejam/sled

[4]: https://fractalideas.com/blog/sans-io-when-rubber-meets-road...

[5]: https://github.com/algesten/str0m

[6]: https://docs.rs/quinn-proto/0.10.6/quinn_proto/struct.Connec...

luc4sdreyer · on Feb 15, 2024

Does something like madsim / Deterministic Simulation Testing exist for Java applications?