There’s some potentially misleading information here. Background: I’ve spent the last 20+ years writing low-latency realtime audio applications, technically cross-platform but focused on Linux.
If you care about low latency on any general purpose OS, you need to use a realtime scheduling policy. The default scheduling on these OS’s is intended to maximise some combination of bandwidth and/or fairness. Low latency requires ditching both of those in favor of limiting the maximum scheduling delay of a thread that is otherwise ready to run.
Measuring how long synchronization primitives take without SCHED_FIFO is illustrative, but only of why, if you care about scheduling latency, you need SCHED_FIFO. There are several alternative schedulers for Linux – none of them remove the need for SCHED_FIFO if latency is important.
It is absolutely not the case that using SCHED_FIFO automatically starves non-SCHED_FIFO threads. Scheduling policy is set per-thread, and SCHED_FIFO will only cause issues if the threads that use it really do “burn the CPU” (e.g. by using spinlocks). If you combine SCHED_FIFO with spinlocks you need to be absolutely certain that the locks have low contention and/or are held for extremely short periods (preferably just a few instructions). If you use mutexes (which ultimately devolve to futexes at the kernel level), the kernel will take care of you a little better, unless your SCHED_FIFO thread doesn’t block – if it doesn’t do that, that’s entirely on you. Blocking means making some sort of system call that will cause the scheduler to put the thread to sleep – could be a wait on a futex, waiting for data, or an explicit sleep.
In particular, this: “This was known for a while simply because audio can stutter on Linux when all cores are busy (which doesn’t happen on Windows)” is NOT true. Linux actually has better audio performance than Windows or macOS, but only if the app developer understands a few basic principles. One of them is using SCHED_FIFO appropriately.
Pro-audio/music creation scheduling is MUCH more demanding than video, and a bit more demanding than games. Linux handles this stuff just fine – it just gives you enough rope to shoot yourself in the foot if you don’t fully understand what you’re doing.
For the same reason, by default, PulseAudio uses rtkit to escalate its privileges and set its scheduler policy to SCHED_FIFO (then drop the privilege). Many people don't understand it and see it as another example of how Lennart's software is taking over the world (especially when you use a distribution like Arch that didn't already have rtkil configured for you), but it's essential for good audio latency (but the audio player needs SCHED_FIFO as well, there's nothing to help if the player itself is locking up).
> SCHEDULING
> high-priority= Renice the daemon after startup to become a high-priority process. This a good idea if you experience drop-outs during playback. However, this is a certain security issue, since it works when called SUID root only, or RLIMIT_NICE is used. root is dropped immediately after gaining the nice level on startup, thus it is presumably safe. See pulseaudio(1) for more information.
> Takes a boolean argument, defaults to yes. The --high-priority command line option takes precedence.
> realtime-scheduling= Try to acquire SCHED_FIFO scheduling for the IO threads. The same security concerns as mentioned above apply. However, if PA enters an endless loop, realtime scheduling causes a system lockup. Thus, realtime scheduling should only be enabled on trusted machines for now. Please note that only the IO threads of PulseAudio are made real-time. The controlling thread is left a normally scheduled thread. Thus enabling the high-priority option is orthogonal. See pulseaudio(1) for more information.
> Takes a boolean argument, defaults to yes. The --realtime command line option takes precedence.
On modern Linux systems, it is impossible for SCHED_FIFO to cause system lockups. A configurable percentage of CPU cycles is reserved by the (default) scheduler for non-SCHED_FIFO threads, so even a SCHED_FIFO thread that runs "while (true){}" will not be allowed to use 100% of a core's cycles.
RLIMIT_NICE is of zero use with realtime audio.
This documentation is based on (1) the state of Linux circa 2008 or so (2) a complete misunderstanding of the role of nice values for scheduling (hint; it's only useful for altering the behavior of non-RT threads, which means threads that do not have hard or semi-hard timing deadlines).
It is a notable problem that there is so much out of date information about the use of nice with audio applications still circulating online. It was never true, and it's even less true today.
Glad to see someone with industrial experience on realtime processing to respond.
> a complete misunderstanding of the role of nice values for scheduling (hint; it's only useful for altering the behavior of non-RT threads, which means threads that do not have hard or semi-hard timing deadlines).
I suspect you're misreading the quoted documentation, I believe the quoted documentation is fine. To recite, "only the IO threads of PulseAudio are made real-time. The controlling thread is left a normally scheduled thread. Thus enabling the high-priority option is orthogonal (to SCHED_FIFO)".
Nice is applied to the controller responsible for other tasks, SCHED_FIFO is applied to the I/O threads, independently.
> On modern Linux systems, it is impossible for SCHED_FIFO to cause system lockups. [...] so even a SCHED_FIFO thread that runs "while (true){}" will not be allowed to use 100% of a core's cycles.
Really? The rtkit documentation mentioned,
> If processes that have real-time scheduling privileges enter a busy loop they can freeze the entire the system. To make sure such run-away processes cannot do this RLIMIT_RTTIME has been introduced. Being a per-process limit it is however easily cirumvented by combining a fork bomb with a busy loop.
> RealtimeKit hands out RT scheduling to specific threads that ask for it -- but only to those and due to SCHED_RESET_ON_FORK it can be sure that this won't 'leak'.
> In contrast to RLIMIT_RTPRIO, the RealtimeKit logic makes sure that only a certain number of threads can be made realtime, per user, per process and per time interval.
So "while (true){}" won't lock your system up, but "while (true) { fork(); }" certainly will, so the rational of only using SCHED_FIFO on trusted processes is reasonable.
(This is from RH, but the kernel params are generic)
Again, RTKit is a convenience wrapper, nothing more or less. It runs with sufficient priviledge to ask for SCHED_FIFO, which by default is not permitted. The more conventional approach is to create a user group and use PAM or equivalent to set the rt priority limit for that group, thus allowing any user in the group to run any app that can then ask for SCHED_FIFO with prio up to the limit. RTKit was created by Lennart to avoid this type of sysconfig (in favor of a different type of sysconfig).
You're mistaken about the historical ordering of things too. SCHED_FIFO was available long before PulseAudio or RTKit was created.
Yes, you could shuffle the text around as you suggest, but that's far from a "trivial" edit. It's basically equivalent to saying "yeah, the old doc was basically totally wrong, here's a correct version" :)
Pro-Tip: Hacker News has a "new comment cool-off timer", it gives a minute or two before allowing a comment to be answered. It gives a chance for the poster to edit the comment. Just relax when you cannot see a reply button.
I see. "The default values for the Real-time throttling mechanism define that 95% of the CPU time can be used by real-time tasks. The remaining 5% will be devoted to non-realtime tasks (tasks running under SCHED_OTHER and similar scheduling policies). It is important to note that if a single real-time task occupies that 95% CPU time slot, the remaining real-time tasks on that CPU will not run. The remaining 5% of CPU time is used only by non-realtime tasks".
So a fork bomb is as safe is an empty loop (at least from a scheduling prospectively), as long as a strict limit is set.
I have noted it down.
> Yes, you could shuffle the text around as you suggest, but that's far from a "trivial" edit. It's basically equivalent to saying "yeah, the old doc was basically totally wrong, here's a correct version" :)
Okay, the main documentation for "man pulseaudio" was a bit better,
> To minimize the risk of drop-outs during playback it is recommended to run PulseAudio with real-time scheduling if the underlying platform supports it. This decouples the scheduling latency of the PulseAudio daemon from the system load a
You see, here it mentions real-time scheduling, not high-priority scheduling in "man pulse-daemon.conf". So I guess the explanation of " high-priority=" was more or less untouched since 2008, but the code kept moving on.
> You're mistaken about the historical ordering of things too. SCHED_FIFO was available long before PulseAudio or RTKit was created.
I meant the renicing mechanism was in PulseAudio before RLIMIT_RTPRIO, or the rtkit framework. I didn't say a word about SCHED_FIFO, which was indeed POSIX, but less useful without later improvement on scheduling policies.
Thanks for your patience and answers, all my questions have been explained.
No, it's not still true, and rtkit is just a convenience item that has nothing to do with core kernel policy.
The use of nice to try to adjust scheduling of audio apps (the entire application or specific threads) is completely off-base. If you use SCHED_OTHER, you're making a declaration that timing is not important. You can bump the nice value of a SCHED_OTHER thread, and that will change its behavior among the pool of the other SCHED_OTHER threads, but that's almost never useful in an audio app.
Now, ionice is another (and more complex) story ...
The point of rtkit is creating a convenient framework to enforce SCHED_RESET_ON_FORK and rate-throttling in a single place. So if I'm interpreting your comment correctly, running a "while (true) { fork(); }", even without restrictions of SCHED_RESET_ON_FORK, will not lock the system up in a modern system? Is it correct?
> The use of nice to try to adjust scheduling of audio apps (the entire application or specific threads) is completely off-base.
I think the renicing mechanism of "high-priority=" was historical, introduced to PulseAudio long before the system resource limit like RLIMIT_RTPRIO, or the rtkit framework has been put into place. As another document mentioned,
g
> To minimize the risk of drop-outs during playback it is recommended to run PulseAudio with real-time schedulin [...] Alternatively, if the risk of locking up the machine is considered too big to enable real-time scheduling, high-priority scheduling can be enabled [...]
So while renicing is still here, PulseAudio prefers SCHED_FIFO nowadays, and the renice is only a fallback.
So, can you point out where exactly is wrong in the first documentation I quoted in the first comment? Certainly, the paragraph, "high-priority= Renice the daemon after startup to become a high-priority process. This a good idea if you experience drop-outs during playback" is outdated, but if one removes the word "Renice", and moves this paragraph from "high-priority=" to "realtime-scheduling=", it's still correct.
Note that by default kernel.sched_rt_runtime_us limits RT usage to 95% of the whole system, not an individual core.
If you have 1 RT process using 100% of a core on a multicore system it won't trigger the limiting, which can cause all sorts of problems (like network interrupts not getting serviced if your process happened to hog the core used by network interrupts).
This can be worked around by writing NO_RT_RUNTIME_SHARE to /sys/kernel/debug/sched_features, which then makes the 95% limit apply on each logical core, not the system as a whole.
Off topic, but thanks to you and the JACK team! I think it's one of the most underrated pieces of software around. I have used it many times over the years and I'm constantly amazed at how good it is, even compared to industry standard proprietary stuff.
Also, a yielded thread has absolutely no reason to be woken up with low latency, and this has nothing to do with the supposed scheduler quality. I'm not sure a kernel blindly prioritizing otherwise cpu-bound thread just because they yielded would give good results in the general case, esp out of microbenchmarks. If you want low wake up latency, you need a proper design, not mere reliance on magic that does not exist.
Plus highly contended critical sections should really really be avoided. So I'm not sure the author's benchmark even make any sense: if all of your workload is serialized, use a single thread.
Windows automatically boosts the prio of threads playing audio, basically treats them as realtime in its prio hierarchy. What I expect GP is talking about is minimum latency achievable, i.e. that with focused tuning, you can outperform Windows.
Regular users of not so tuned apps probably experience this differently because Windows tries to do the tuning automatically and Linux doesn't.
The Windows scheduler has other issues, like random prio boosting to escape from prio inversion.
Right, I'm aware of the issues on Windows. I was asking about XNU-based platforms which are known for having very low latency and excellent realtime audio perf out of the box.
(Granted, the out of box behavior and the achievable latency are two separate issues)
> Right, I'm aware of the issues on Windows. I was asking about XNU-based platforms which are known for having very low latency and excellent realtime audio perf out of the box.
I have never seen a mac box achieving reliable lower latency than Linux or Windows on the same hardware. Macs are nice for the plug'n'play experience - even though more often than not you still want to install your soundcard's driver.
I'm not sure that posting 9 year old benchmarks is particularly useful, particularly when those results have been heavily debated.
If we're talking pure latency then UA, Focusrite, Apogee and Avid have lower RTL for the same buffer size in macOS. RME is identical on both platforms and I'm unsure about antelope.
Regarding loads before under-runs that's a very tough thing to test, as it varies between products. GENERALLY: Comparing Digital Performer between platforms heavily favours macOS. Comparing Pro Tools favours macOS (currently). Comparing Reaper favours windows slightly. Cubase favours Windows... etc...
There's also load type that factors in, as with macs you get a specific build with minimal leeway. Windows builds can be planned for a given build (highly serial, highly parallel, RAM hungry, Latency performant etc...). It's not too hard to build a test suite that favours macOS everywhere on the same machine, likewise for Windows.
But to the original topic, latency, macOS is generally superior in this regard when looking at RTL specs for the same device.
> Blocking means making some sort of system call that will cause the scheduler to put the thread to sleep – could be a wait on a futex, waiting for data, or an explicit sleep.
Speaking of which, do you know of a tool that would allow to ensure that e.g. your RT threads won't do bad things - short of putting a breakpoint on every syscall-like thing ? I wonder if at that point the audio community should not look into developing some clang-based static analyzer to enforce that.
There have been several such tools over the years - I regret that off the top of my head I don't recall the names of any of them.
Clang can be used for this but AFAIK you need to annotate functions/methods to indicate whether or not they are intended to be RT-safe. This gets complex if you have code that can be used in an RT context and in a non-RT context. The annotation part would be relatively simple in, say, a small audio player. In a DAW, it would be a substantial and complex task.
I've done things before where you shim in a library (using LD_PRELOAD) between the executable and the C standard library, and blacklist function calls which are potentially blocking, (abort if they get called in the wrong context).
It's usually sufficient to do this in testing, and then have as is, in production.
The claim that "the Linux scheduler is bad" seems wrong.
First of all, a program that calls sched_yield() for synchronization is simply broken, since doing so can cause 100% CPU usage (if the other thread is stopped at an unfortunate place), so the only correct method among the ones benchmarked is std::mutex.
When using std::mutex (the only correct approach), Linux and Windows both seem to have similar average performance, but Windows has 10x latency, so Linux is better.
sched_yield() isn't being used for synchronization, it's to yield the calling thread before its timeslice is exhausted because it's potentially preventing the lock-holder thread from getting scheduled to run and releasing it.
Spinlocks assume the critical section is fast and non-blocking, so when the spinning lasts longer than expected, throwing in a yield can help by handing some CPU time to something else.
The lock-holder might be getting scheduled on the same CPU and waiting for this thread to yield, or there might just be excessive threads runnable right now and burning cycles pointlessly spinning isn't helping get the lock-holder to run and release the lock any sooner.
In a situation where you have exactly as many threads as there are CPUs, and they've been assigned their own respective CPUs, and there's nothing else running on the system, I wouldn't expect the yield to help. But that ideal isn't how things generally look in practice.
If a spin lock ends up in a call to sched_yeld, it means that it is already outside its intended use case (uncontrnded or nearly uncontented), so it is a bit pointless to optimize this path or expect any sane scheduling behaviour. If contention is probable it is better to fall back on a futex after a few[1] busy spins.
This is what pthread_mutex (and pretty much every decent OS provided mutex does) on glibc by default.
[1] how much is 'few' of course is often application dependent.
Sure, but sched_yield is strictly worse than using a futex (except that you'll need to check for waiters on an unlock which makes it slightly more expensive).
Ok, so you're arguing that userspace should use mutexes and not spinlocks.
Which I agree with, most of the time userspace spinlocks don't fit well.
But TFA is clearly comparing the two, and observing a variety of spinlock implementations with sched_yield() demonstrating an interesting positive effect on the spinlocks as tested.
Actually no, I wouldn't make that claim; while a futex based adaptive mutex is a very good default, spinlocks can be still approriate for some applications.
What I'm saying is that if your use case is such that you expect enough contention to consider using TATAS (which is actually a pessimization in the uncontented case) and look into optimizing sched_yield, probably a spinlock is not appropriate on the first place.
Edit: hence a spinlock shouldn't bother with yield and just do a tight xchg spin (I haven't measured it inna while, but heard rumors that pause can severely harm acquire latency on very recent CPUs as it will quickly put them in a deeper power saving mode than in the past)
My experience is that prior to Windows 8.1 my computer was completely unusable while compiling C++ on all cores, so bad that I got a tool to limit affinity to all but one core.
I've "always" been able to do the same on a Linux system, even filling up HT threads without noticing anything. I've always been impressed by the Linux scheduler both for IO and CPU.
I’m not convinced the measurements are really meaningful in this article. The author seems to be running a bunch of threads, measuring the time from when one thread starts unlocking the lock to when another thread gets it and calling this “idle”. I can’t find any mention of how many threads there are, how many CPUs, or what else the system is doing.
So this is measuring a mess of a few things:
For mutexes (assuming all the implementations actually make syscalls for unlocking), this has something to do with how long it takes for the scheduler to activate the waiter and run it, plus how long the scheduler waits because the system is doing something else or it wants to try to avoid bouncing a task between CPUs.
For anything that does not involve the kernel in unlocking, this is really just a measurement of the pauses in execution of long-running threads that continuously hog the CPU. If the thread is also calling sched_yield(), the results are erratic, since sched_yield() more or less means “I don’t want to run now, but I’m not actually waiting for anything visible to the scheduler and, um, run me again soon maybe?” This is almost always the wrong solution. Other than that, this test has essentially nothing to do with which spinlock is in use, unless the spinlock itself wastes time failing to figure out that it’s ready.
From old memory, the Windows scheduler is (was?) mostly based on per-thread priorities, highest priority wins, where the priorities are occasionally automatically adjusted. So, if you set a high-ish priority, you get to hog the CPU. In contrast, Linux tries quite hard not to let normal (non-RT) threads hog the CPU, so a thread that effectively runs continuously measuring pauses will detect significant pauses in which Linux decides that the rest of the system deserves to run too.
I agree that the post needs to clarify the number of threads, what thread priorities were assigned, and what measures were taken to associate the threads with logical CPUs (affinities).
I think your characterization of sched_yield() is a bit off. The man page is clear:
"sched_yield() causes the calling thread to relinquish the CPU. The thread is moved to the end of the queue for its static priority and a new thread gets to run."
and
"If the calling thread is the only thread in the highest priority list at that time, it will continue to run after a call to sched_yield()."
In particular, it does not say "I’m not actually waiting for anything visible to the scheduler and, um, run me again soon maybe?".
So assuming that the threads are running at some relatively high priority, I'm struggling to see how delays on the order of 60ms could arise.
EDIT: reading further:
"sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken."
Indeed. Linux’s SCHED_OTHER doesn’t have static priorities.
There was a half-baked design for locking that kind of worked 20 years ago: try to acquire the spinlock and, if it’s contended, yield. On a single CPU, that probably means that the lock holder gets to run, and if they drop the lock quickly, then the yielding thread will (on an otherwise mostly idle system) get to run soon.
This was a dubious design 20 years ago, although, admittedly, OS lockout primitives were worse back then. But now, with multiple CPUs on a system, the whole scheme falls apart — the lock holder might be on a different CPU and yield() has no effect on how quickly the lock is released.
So, in most cases, yield() really does mean “I don’t know what I’m doing, but I don’t want to run right now.” Anything the kernel does in response is, at best, a wild guess.
With the current Linux scheduler, yield() means "I don't need to run and won't need to in the future either." and Linux places it at the very END of ALL SCHED_OTHER processes. Even if they are at Nice 20.
So it is no surprise that latencies are high, because a yielded thread has to wait for absolutely every other thread in the system before it gets a chance to run.
I think the answer, as always, is to measure on your own workloads. I’ve seen real-world production code get significant speed ups by switching to spinlocks from mutexes.
Very interesting post though. I’ve never considered the idle latency before.
Windows has process and thread priorities, including so-called "real-time" ones... but there seem to be no mentions of them in the article, as opposed to Linux. Why?
My take on this article is it doesn't worth implementing your own spinlock. The OS can take advantage of their implementation details while you can't portably implement it. The OS implementation changes in the future or varies.
Even if you use _mm_pause() to let CPU know it's doing spinlock, for the eye of the OS, it still looks like a busy thread, it can't be distinguished from non-spinlock computation heavy thread.
yep, I was assuming they at least didn't have any frequency scaling or powersave governor junk going on but you're right - it wasn't even mentioned so...
You're getting some incomplete answers to your question. Both a mutex and a spinlock is used to guarantee that a set of instructions are only invoked by a single thread of computation at once (with some variation; some mutex types permit multiple readers for example).
The difference is actually in the implementation of the mutual exclusion. A mutex is generally a system call which often starts as a spin lock before the system call is invoked (which is an assembly routine that changes the process privilege level and cedes control to the host operating system). A spinlock literally "spins" in a loop until an atomic flag or semaphore indicates that it can enter the protected region. When it does, it atomically sets the flag so that it has exclusive access. Because a spinlock does not interact directly with the OS scheduler, it needs a pause instruction or yield syscall at some point in its loop (or every few interations). Failure to do so introduces pathological scheduling problems as outlined in the article.
Spinlocks are sometimes used because it is believed that the odds of a collision are very low, and the cost of a syscall is perceived as being high to the developer. However, it's something that should only be used if you know what you're doing, and its behavior is subject to not only the OS, but also the hardware you run on.
These are words that don’t necessarily mean different things. In the broadest sense of mutex and the strictest sense of spinlock, every spinlock is a mutex.
Spinlock is a mutex algorithm based on spinning until the mutex is available.
Mutex is a MUTual EXclusion abstraction. A spinlock is one implementation. But spinlocks are problematic (for lots of reasons) so most of the time when someone says they’ve got a mutex it’s actually an adaptive lock:
- It has a fast path to acquire the lock.
- It may or may not be fair.
- It has a slow path that may enqueue the thread and deschedule it, but that slow path may have some adaptive spinning for a finite amount of time.
A spinlock is a kind of mutex (“mutual exclusion”) where in a condition is checked as fast as possible in a loop. This is useful when the expected wait times are short as you don’t have to pay the context switch penalty. Typically done when you have real parallelism with multiple CPUs/cores.
A mutex can also refer to a particular group of “mutual exclusion” methods where in the current process/thread is suspended until the condition is true.
Additional note that mutual exclusion is a concept that pops up a lot when dealing with concurrent processing (think threads, multi-process type architectures).
Typically mutex / spinlock is implemented to reduce and prevent race conditions
One can see this at a higher level in Golang, where a gofunc are reading/updating a map - and run with -race then it shows where you have to use a Mutex call to lock that Map for the access, R or RW. Higher level but pretty simple to see.
One can see this at an even higher level in Rust, where you don't even need a "race checker" to avoid data races in safe code - the compiler will force the use of patterns like Mutex<> or Rwlock<> (or others) for stuff that might be accessed concurrently in ways that would otherwise break mutual exclusion.
Not really, in Rust's case. It's the general borrow checker which handles this.
Since in Rust it is illegal for a value to be referenced for writing in one place and referenced for anything else in any other place, regardless of concurrency/parallelism, Rust code is data-race free by design.
I still think that GP's comment was a non-sequitur.
Rust has "interior" mutability, meaning that shared references are not true 'read only' references; they can be used for writing if the type supports such an operation. What makes Rust data-race free is the combination of 'interior' mutability and the Send and Sync traits, which control moving or referencing objects across threads.
Rust's compiler checks a mathematical proof that, if your types are valid (e.g. you didn't do anything bad during `unsafe`) you do not have any data races. It is a guarantee about your program per se.
Go's race detection says (with a high probability) you do or do not have any data races in the paths actually taken during some execution of your program. It is a (probabilistic) guarantee about how you ran your program.
It's not a mathematical proof in the literal sense (yet), because there are known soundness bugs in Rust that are yet to be addressed; moreover the interaction of borrow checking (including interior mutability) with the traits system is poorly understood and has not been proven to be correct even in a simplified model. It should yield a meaningful level of checking, and if nothing else, the ergonomics is way better than just running a static analysis on C/C++ code (setting aside the question of whether current static analysis tools can even check for possible data races); but you shouldn't rely on it to that extent.
I don't program Rust regularly or track its status in that much detail, so thanks for the correction.
Do you have some links about the known soundness issues? I was under the impression that although it hadn't yet been proven it was still a medium-term goal, and any soundness bugs were considered fairly major.
(And yeah, "mathematical proof" is strong regardless since e.g. LLVM is also in there somewhere. But it captures the idea more closely than any comparison of "compile vs. runtime" or "static vs. dynamic" does.)
https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Ao... is the list of soundness bugs. They are considered major, but they’re more compiler bugs than something broken in the language itself. They are taken seriously, but bugs will always happen. Some need upstream fixes, too, so that can take a while.
Mutex and Spinlocks are abstractions / programming language constructs used to solve the following class of problems:
Problem
1. We have a process with multiple threads of execution (e.g. A and B).
2. A and B will at some point act on the same data or object state.
3. Both A and B can't operate on the same data at the same time.
Mutexes function more or less as locks. They lock up your data to prevent unwanted race conditions. In multithreaded environments, a thread of execution has to acquire a mutex prior to accessing the data. Once it's done operating on the data, it releases the mutex so that another thread of execution can acquire it.
Mutexes are again, simply locks. But not all locks are the same. Locks come in different shapes and sizes and work in different ways. Same with mutexes, they can have different semantics.
Spinlocks are a kind of mutex that cause threads trying to acquire them to spin or loop around while waiting for the resource to be available.
Thanks. It would be great to eliminate "can anyone", "can someone", and ELI5 from HN in 2020. We're literally conversing through the worlds largest connected and searchable knowledge base (the internet), and don't need to encourage lazyweb.
I actually come here for explanations like these from from experts and HN lurkers instead of reading the pop-internet article that comes up first on google.
I know what both a mutex and a spin lock are because I’m a developer myself but very much enjoy having my mind blown by someone on here who may have actually written the scheduler for Windows or Linux or be an expert of some sort.
Real experts aren't waiting on the internet to be the first response to lazyweb: they publish blog posts, papers, books, and tomes of knowledge, which are peer-reviewed and revised. Experts who have something valuable to add will eventually speak up, without being asked. I guarantee that answers to such low-quality questions primarily get low-quality responses. It's the Tragedy of the Commons of discussion sites.
I recently had a beginner question about actors and persistence addressed by carl hewitt here. I think you underestimate the kindness of extremely knowledgeable people ( and HN )
Since it's covered quite simply and very visibly in the article: yea, this is one ELI5 I'd prefer to not exist. At least skimming TFA is a reasonable pre-requisite IMO.
When the article assumes reader knowledge (especially very specific stuff) tho... there are quite a few very good responses here, it'd be a shame to reduce them. There are some extremely knowledgeable people lurking around here, and they're sometimes doing so for this kind of reason - to spread the knowledge. Plus context-specific Q&A can be far more effective than random googling / other articles / etc, as a learning tool.
A typical mutex has essentially two very different costs depending on whether the lock is already taken by another thread:
- Uncontended case + mutex: no syscall, fast path is basically the cost of the memory barrier.
- Uncontended case + spinlock: no syscall, fast path is basically the cost of the memory barrier. Essentially same cost than a mutex, sometimes a tiny bit cheaper but _if your lock isn't contented in the first place it will be very hard to measure, since not being a bottleneck_.
- Contended case + mutex: thread is waiting in the mutex list without taking CPU. Hyper-threading may also give hand immediately to another thread, perhaps lucky enough to run. The OS mutex may have a bit of capped spinning before waiting in the scheduler.
- Contended case + spinlock: Anything goes. CPU gets consumed just for spinning. Wrong thread might get priority. If you are lucky you did put a HLT instruction to mitigate the complete disaster that spinlocks are, perhaps it even does something. Your consolation: no syscall or thread pausing... because they are spinning.
tl;dr: spinlock are worse in the contended case and essentially the same performance in the uncontended case against your typical OS mutex. SPINLOCKS WON'T YIELD THE EXPECTED BENEFITS, USE REGULAR OS MUTEXES.
A problem with pthread mutex and std::mutex is that they are often very large which is problematic for very fine grained locking, while you only need one bit for a spin lock and can be embedded in other objects (say, a pointer) as long as you can spare one bit.
Of course you can roll your own proper mutex if you want to get your hands dirty with futexes. IIRC you only need two bits (one waiter bit and one locked bit).
You can do it with a single bit: Read the whole unit (eg 64 bit), flip the big you need, and then do a Compare exchange which tries to replace the whole retrieved thingy with the modified one.
That obviously requires you to read the current state upfront, whereas a plan compare exchange from 0 to 1 on 32/64Bit avoids that.
Interesting!
Now I'm interested if there has been one usage case for the single bit spinlock in practice. It seems appropriate if writes aren't too numerous (well leaving aside speed concerns).
Or a case where std::mutex was too large?
I've always wondered how the wait list was implemented.
Think about a concurrent hashtable using chaining for collision resolution; distinct bucket chains can be updated concurrently, so you need a per bucket lock but you do not want a full std::mutex as it would increase the size of the bucket table by a large integer multiple. Instead, borrowing one bit from the chain pointer (you can easily arrange the least significant bit of the pointer to be always zero) is pretty much free.
Note: I wrote the above from the top of my head, I haven't actually had the need to implement a concurrent hashtable yet.
I believe the supposed beneficial case of blocking vs spinning can't be captured well by those categories, as it's "contended, but not for long" and "contended, but for longer".
That's the reason for the "may have a bit of capped spinning before waiting in the scheduler".
Honestly I'm not so sure about the case for spinning; in x264 source you can see CRITICAL_SECTION is tuned to have a spincount of 0. This is a very optimized, multi-threaded app. They didnt want any spinning.
In my own attempts I've been unable to measure the impact of the spinlock count, and took spincount = 40 like in https://trac.webkit.org/browser/webkit/trunk/Source/WTF/wtf/... ...who also took measurement from elsewhere. The default is 4000 on Windows!
I don't argue spinning isn't valuable ; what I say is that it's necessarily never the bottleneck (else your task size really is too small).
After all, if you spend all your time in synchronization, that means you probably need to have bigger tasks, not more clever synchronization primitives.
Besides OS where they are often needed, spinlocks can have their place in a few very specific (and most of the time isolated on dedicated computers), but I agree that too many people are trying to play with them now, without enough knowledge.
I'm curious what is going on with the "Idle Time" measurements. Does anyone have any theories?
My guess is that if the thread gets descheduled inside the lock() spin loop (either by an explicit call to yield() or otherwise) then the "Four longest idle times" reflect the fact that all of the worker threads have remained descheduled for an extended period. But assuming that the threads have elevated priority (why would you use spinlocks on non-high-priority threads?) this seems implausible on an 8 core machine.
I would be curious to know what the results are if isolcpus was used and the workers were explicitly assigned to those CPUs.
The whole "idle time" vs. "longest wait" distinction is suspect and it's surprising they don't go to more effort to justify it.
What "longest wait" measures is (b_i - a_i) in:
Time a_0
lock
Time b_0
unlock
Time a_1
lock
Time b_1
unlock
...
What "idle time" measures is (b_i - a_i) in:
lock
Time a_0
unlock
lock
Time b_0
Time a_1
unlock
lock
Time b_1
...
At least for the spinlock variants, "unlock" should be an extremely cheap operation (just a store for most variants), which means that those two sequences ought to be essentially identical in terms of timing: it ought to be possible to move the "Time a_i"s across the "unlock" operations with a timing impact less than a microsecond, i.e. the two sequences ought to have essentially the same timings.
The fact that the timings are supposedly different on a millisecond scale is therefore quite suspicious.
Perhaps the chosen clock ends up calling a syscall, which can end up in the scheduler on syscall return occasionally and throw off the measurements that way? Or just the fact that there's one extra syscall in the critical section sometimes causes other threads to make different yielding decisions which throws off the scheduling?
More generally, it would have been great to see a deeper investigation into scheduling behavior, which is possible on Linux using the tracing infrastructure and visualizing with different tools like Hotspot.
Getting a worst-case delay of ~1.5ms with a naive ticket_spinlock that can yield is really not surprising, to be honest. All it takes is for patterns like:
1. 3 threads pull their tickets in order.
2. The middle thread yields and something else is chosen to run by the scheduler, or perhaps the scheduler decides not to run anything for a short while on the off-chance that something else becomes runnable, which is a fair choice because after all, you yielded.
3. The last thread is then blocked by the middle thread.
It just boils down to the fact that "yield" is a bad primitive. You should fall back to futex-based synchronization instead so that the kernel has a better idea of what you're trying to do. The article comes to the same conclusion anyway, in the form of "just use std::mutex".
It's not entirely clear from the article, but my understanding was that "longest wait" maintains per-thread timestamps, i.e. the time to lock() within a single thread, whereas "idle time" uses a single global atomic timestamp that measures the time from when the last thread unlocks until the next thread locks -- even if the unlock() and subsequent lock() are on different threads.
Even with isolcpus, there are some kernel threads that will preempt user threads (kworker/rcuos/softirq) unless you set your priority higher than those with sched_fifo/rr
Why isn't the conclusion to just use the mutex at least on Linux? It seems to have both consistently low latency and great throughput. What else can you get from the spinlock?
> I hope to have convinced you to avoid spinlocks, even for very short sections. Spinlocks are only a benefit if you really really care about the fast path and are certain that the lock will essentially always be free to take. On my computer the fast path on a spinlock is roughly 7ns and on a mutex it’s roughly 14ns, so both of those numbers are very fast. But when spinlocks go bad, the scheduler can be really confused, at least on Linux.
I did read it till the end but missed that bit about the fast path somehow. I struggle to see a case where you care enough about those 7ns but the Windows scheduler was a different story.
> What these benchmarks have in common is that they measure code in which there is nothing else to do except fight over the lock. In that environment the only thing that makes sense is to use a spinlock. Because if you go to sleep, all that will happen is that some other thread will wake up that will fight over the same lock.
That said, his conclusion is in fact just as you stated, if you read in between the lines. (This paper is really about the scheduler, so it's not exactly that conclusion.)
IMO mutexes, semaphores, spinlocks should be used only in special cases (e.g. games or other high perf logic where you need to maximize performance by a fixed percentage) because they are anti-patterns for scalability. They only help to scale up to a point after which all the threads either spend all their time waiting for each other to release the locks or they can never acquire the locks because the demand for that shared memory is too high. There are exceptions for example if you have unlimited reader threads/processes compared to a fixed number of writer threads/processes but even in those cases, I feel that shared memory locks add invisible bottlenecks to scalability as the code evolves over time.
For most of my use cases, I prefer to keep processes/threads parallel and communicate via IPC chanels/sockets. This makes it easier to write code that can scale without limit.
For most use cases, I don't see much point of writing scalable code that you know only scales up to a certain point and then needs to be rewritten.
Avoid using shared memory if you have an embarassingly parallel problem.
Approaches using IPC/sockets are significantly more complicated than shared-memory parallelism in most programming languages. Sockets at least are also an absurdly wasteful communication mechanism in low-concurrency situations, compared to direct shared memory access.
If you have a simple REST web server, single-thread is simply not a realistic option. The next simplest thing is to do shared memory parallelism, which works perfectly for a few dozens to low hundreds of threads. That already covers a lot of scale for many realistic problems. Moving from that to a distributed system design is a significant jump in implementation complexity, and will likely increase per-request latency with most straight-forward implementations.
>> which works perfectly for a few dozens to low hundreds of threads
It's a lot more complicated that than. It depends heavily on how many threads are writing to the shared memory and how often. If the workload is write-heavy from many threads then then it can really add up. Also, if there are multiple shared memory locations which all use mutexes, it adds up as well. Shared memory offers diminishing returns as the logic becomes more complex.
Often, it's better to simply design the system in such a way that it removes or at least minimizes interactions and data sharing between different threads or processes. With such designs, often the overheads of IPC communication become very low.
Some engines like Node.js are focused around IPC as the main mechanism for sharing data between different processes. It produces relatively elegant code which is easy to follow and this design makes it relatively easy to identify the bottlenecks.
I agree that the approach I was suggesting is probably not ideal for low-concurrency situations (like games as I mentioned in my last comment) but I still think that shared memory access mechanisms are not elegant. That said, I find that programming solutions which try to squeeze every last drop of performance out of the hardware are rarely elegant architecturally.
In my experience it is very to easy get into situations where using only mutexes is impossible. As soon as you need a lock per object and access two objects then you have to make sure to only acquire one lock at a time. The end result is that you are effectively using the stack as a message queue that can only hold one element.
The solution with multiple locks is to always take them in the same defined order on any code path - that guarantees that you can't get into deadlock situations.
There’s some potentially misleading information here. Background: I’ve spent the last 20+ years writing low-latency realtime audio applications, technically cross-platform but focused on Linux.
If you care about low latency on any general purpose OS, you need to use a realtime scheduling policy. The default scheduling on these OS’s is intended to maximise some combination of bandwidth and/or fairness. Low latency requires ditching both of those in favor of limiting the maximum scheduling delay of a thread that is otherwise ready to run.
Measuring how long synchronization primitives take without SCHED_FIFO is illustrative, but only of why, if you care about scheduling latency, you need SCHED_FIFO. There are several alternative schedulers for Linux – none of them remove the need for SCHED_FIFO if latency is important.
It is absolutely not the case that using SCHED_FIFO automatically starves non-SCHED_FIFO threads. Scheduling policy is set per-thread, and SCHED_FIFO will only cause issues if the threads that use it really do “burn the CPU” (e.g. by using spinlocks). If you combine SCHED_FIFO with spinlocks you need to be absolutely certain that the locks have low contention and/or are held for extremely short periods (preferably just a few instructions). If you use mutexes (which ultimately devolve to futexes at the kernel level), the kernel will take care of you a little better, unless your SCHED_FIFO thread doesn’t block – if it doesn’t do that, that’s entirely on you. Blocking means making some sort of system call that will cause the scheduler to put the thread to sleep – could be a wait on a futex, waiting for data, or an explicit sleep.
In particular, this: “This was known for a while simply because audio can stutter on Linux when all cores are busy (which doesn’t happen on Windows)” is NOT true. Linux actually has better audio performance than Windows or macOS, but only if the app developer understands a few basic principles. One of them is using SCHED_FIFO appropriately.
Pro-audio/music creation scheduling is MUCH more demanding than video, and a bit more demanding than games. Linux handles this stuff just fine – it just gives you enough rope to shoot yourself in the foot if you don’t fully understand what you’re doing.