Performance: Adventures in Thread-per-Core Async with Redpanda and Seastar

jandrewrogers · on Dec 3, 2023

If you have never designed a thread-per-core architecture and want some gritty inside color on what designing these systems is like, this is the article for you. I’ve designed thread-per-core architectures for 15 years now and this captures the zeitgeist perfectly. As for why you might want to care about “thread-per-core architectures”, the performance metrics are qualitatively better than any other software architecture you can use, it isn’t even close. These architectures were first developed in the supercomputing/HPC world but the principles apply to most software.

An issue is that there are few libraries that are designed or optimized for thread-per-core architectures. Storage management is a reusable concept in these architectures but there is a lack of competent and scalable libraries that do this optimally. For someone coming into the space, you have to learn how to design e.g. high-performance storage allocators that can keep up with a JBOD of NVMe. I have mature libraries like this but they are tightly coupled to the rest of the software I design.

When I first designed these architectures I sharded up resources across cores but there are many cases where this requires giving up a lot of performance. The problem of shedding load and resources across cores is really interesting. An implication is that some structures should be shared globally but you really want the performance to be as close to contention-free as possible or you defeat the purpose of thread-per-core. This is possible and there are design heuristics that approximate it but it isn’t discussed much.

As mentioned in the article, no matter how you build these things your software will require a sophisticated understanding of lifetimes. Sometimes the builtin tools of the language will help with this, other times they can’t and you will have to build your own tools.

I have never used Seastar. Not because I am not familiar with it but because it makes tradeoffs that are not appropriate for the workloads I tend to support. It is a good choice for many workloads. The only thing this indicates is that we are still in the early days of these types of architectures. Even if you don’t care or need to improve your architectural efficiency, these types of architectures significantly reduce the carbon/cost footprint of software systems.

samsquire · on Dec 3, 2023

I'm a hobbyist in this space. I have never used Seastar either, But I'm trying to design an "evented assembly" for efficiency in a thread-per-core architecture.

I really want the next generation NodeJS equivalent to take advantage of all these learnings and learnings from Erlang. I think nodejs showed how awesome event loops and async can be and simple.

The TFA (The Friendly Article) makes it clear there are sharp edges with coroutines and lambdas in C++:

- something stored in a temporary between a yield barrier will be destroyed by the time the coroutine is resumed

- you have to make sure you store things inside the coroutine promise object

- lambda capturing rule complexity associated with copying and moving semantics

I am just learning but I've written a multithreaded barrier in C and Java that is mutex free. In 6×2 thread pairs it can communicate between threads in 42 nanoseconds.

I have the beginnings of a JIT compiler. But there's so much work to do.

My barrier resembles a "phaser" pattern in a loop and bulk synchronous parallel.

FridgeSeal · on Dec 3, 2023

> want some gritty inside color on what designing these systems is like,

I’ve only dabbled with them on hobby projects, but I absolutely love the idea. I’ve been on the hunt for more content about this, but I have trouble finding much, I suspect a lot of content is a bit “if you know you know” sort of insider knowledge stuff, so if you know of any other good content lurking out there I’d be so keen to know.

> this is the article for you. I’ve designed thread-per-core architectures for 15 years now

How’d you get into this/what do you do for work? As I said, I’ve dabbled with it on personal projects, but at least on everything I do for work it’s not always a good fit, or it is, but it’s too weird for my teammates.

bitcharmer · on Dec 3, 2023

> I suspect a lot of content is a bit “if you know you know” sort of insider knowledge stuff

It is, the kind of knowledge that takes decades to accumulate you won't see on people's blog posts. This is why ultra low-latency and high-performance systems engineers can ask more for their services.

> How’d you get into this/what do you do for work

I'm not the OP but for HFT shops thread-per-core design is the only right way. Prop trading, big market makers etc, we all go to great lengths to isolate the workloads as much as possible.

mgerdts · on Dec 3, 2023

You may also want to look at SPDK and DPDK.

kosolam · on Dec 3, 2023

Hello! Please tell me what kind of stuff you are doing that requires this and what are the different tradeoffs that seastar makes? I used seastar to build a software switch for SDNs some years ago…

ddorian43 · on Dec 3, 2023

What do you think of the thread-pipeline architecture that claims to be better than thread-per-core https://docs.rondb.com/design_thread_pipeline/ ? (creator of NDB Cluster)

infinite8s · on Dec 5, 2023

What sort of tradeoffs do you need to deal with that make Seastar inappropriate for your workloads?

girvo · on Dec 3, 2023

Funnily enough, this has a fair amount of overlap with the embedded work I do these days: our architecture is thread-per-core (well, adjacent, as the vendor SDK we build on has two other threads internally we can 't get rid of), data locality and latency matters, and so on. Looking forward to going through this in detail, even if my 368KB of RAM and SPI flash are quite a bit slower than your gigabytes and GB/s of NVMe flash!

Specifically our device/firmware is very networking/IO and runtime-configurability focused, which makes for some fun challenges. Squeezing the most out of the microcontroller and peripherals has been important, and keeping data/code locality has helped lots.

ilaksh · on Dec 3, 2023

Almost all of this seems to be the same or similar to distributing computation over the internet. At least the main concepts like "sending work to your data". And I think they are talking about CPUs, but that main idea also could pertain to GPUs.

I wonder if there is a level inside of modern CPU cores where cache is shared by different parallel pipelines or sub-caches for each.

It's sort of fractal. Moving the data around is the bottleneck, so the most efficient way is to arrange the data so that data movement is minimized, and that applies at different zoom levels.

I suspect that there is a lot of research in memory-optimized computing that are possibly different enough from typical approaches to be considered new paradigms.

I wonder if this could intersect with something like Mixture of Experts in LLMs. I don't actually know how that works, but maybe it could allow for grouping the related expert weight data closer together to speed things up.

Also, think about the complexity of real neurons when compared to something like those in an MLP. My impression is that real neurons pack in (locally) much more information and compute per unit.

I wonder if you could use some type of model to predict what data you will need for text generation (aside from the KV cache of course), and then have a fast cache for each task.

Is there a large open LLM or LMM trained using something like https://github.com/lucidrains/st-moe-pytorch ?

smallstepforman · on Dec 3, 2023

Not once did I see the author mention “actors”, an entire generation of skilled engineers tries to reinvent them (with conflicting naming) poorly, this is a failing of our industry’s academia and trade press.

Actors are asynchronous objects with a message queue, designed to run on a thread pool. You can pin them to a core, or allow migration via work stealing (to spread the workload). Did you know that in modern many core systems, a hot CPU core will halt to cool down? Work stealing will offload non-pinned actors, balancing the load.

There is one stumbling issue with Actors which is “sync” work, a transaction which requires one actor to update anothers state before continuing. This can be resolved by “locking” an actor, but this mechanism is consceptually “dirty”, but solves real world problems.

jandrewrogers · on Dec 3, 2023

> or allow migration via work stealing (to spread the workload)

FWIW, work stealing is often not recommended in thread-per-core architectures because it introduces quite a bit of unnecessary thread contention. You are correct that load balancing is a central problem in these architectures but it is usually achieved by shedding data since that can be done with minimal locking and inter-thread coordination. This moves the problem to figuring out what data to shed but this has satisfactory inexpensive solutions in many system designs.

u320 · on Dec 3, 2023

The people who invented TPC were not unaware of the actor model. They just found it unsuitable for their needs.

rat9988 · on Dec 3, 2023

I'm pretty surprised you think the author should have bothered talking about actors. Why ?

yaantc · on Dec 3, 2023

Because the article is on developing a "shared nothing" system where sharing is replaced by communications. Actor systems are also about shared nothing and using messages based communications, and they came first historically.

Now to be fair, a lot of actor systems were about correctness and not squeezing the most performance of a platform. Not sharing data means no explicit locking with mutexes (though one can still have deadlocks, as in several actors stuck waiting for each others) and something simpler to analyze: the system is made of communicating state machines. Actors were also used in high performance system like telecom switches (high performance, for their time), but here too correctness was probably the main concern. At the time actors were first used accessing even far memory was not as costly as today, so the cost of synchronization wasn't as bad. But it was already tricky to get right.

Still, actors are perfectly on topic when discussing a shared nothing, message based architecture. They're worth mentioning IMHO, to give some historical perspective. And then one can combine both approaches: a pinned thread per core can dispatch its messages to a set of cooperating actors for example.

rat9988 · on Dec 3, 2023

Thread per architecture is a topic in and by itself in computer science. I feel like you are talking about it as if the author invented it to solve its needs.

yaantc · on Dec 4, 2023

Oh I'm well aware the author didn't invent any of this: I'm currently working on a "VPP lite" [1] implementation for an embedded packet processing application and as part of my "SoTA" review learned about TPC history (in networking at least).

I was just trying to answer your question on why agents could be relevant here, nothing more.

[1] https://fd.io/docs/vpp/master

rat9988 · on Dec 6, 2023

Thank you for your answer, and interesting repo by the way, hope your implementation of the lite version goes well :)

samsquire · on Dec 3, 2023

In my multithreaded ringbuffer and barrier I lowered latency from 10,000s-100,000 of nanoseconds to under 100 by alignment to 128 bytes to stop false sharing and pinning threads to an EVEN numbered core. I think hyperthreading interferes with things.

There is a core to core visualiser here.

https://github.com/andportnoy/core-to-core-latency

packetlost · on Dec 3, 2023

"Hyperthreading" (or whatever AMD's equivalent is), to my understanding, works by having multiple instruction streams share a pipeline in a superscalar processor. So if you have 2 processes running on the same core that are dependent on each other, you stall your pipeline more because more instructions are dependent on each other.

That being said, because hyperthreaded workloads share a pipeline and a cache, there might be benefit for memory-constrained applications to pinning pairs of processes to the 2 logical cores on the same physical core if it's highly queue-like and you can process the data in similar numbers of instructions for codestream.

bitcharmer · on Dec 3, 2023

Why? Just disable HT in bios or run with nosmt