Hacker News new | past | comments | ask | show | jobs | submit login
Real-time audio programming 101: time waits for nothing (2011) (rossbencina.com)
236 points by ssfrr 8 months ago | hide | past | favorite | 121 comments



(2011) But a great summary and mostly evergreen

One practical reality it doesn't share is that your audio processing (or generation) code is often going to be running in a bus shared by a ton of other modules and so you don't have the luxury of using "5.6ms" as your deadline for a 5.6ms buffer. Your responsibility, often, is to just get as performant as reasonably possible so that everything on the bus can be processed in those 5.6ms. The pressure is usually much higher than the buffer length suggests.


When I was working on desktop audio software, our rule of thumb was to keep CPU occupancy below 30%. More than that, and you'd be sure to get hitching in your audio stream. (This factors in thread contention, other background tasks stealing CPU cycles, file system pauses...)

A much different experience from embedded programming, where 99% occupancy is no problem at all.


In the context of the article, I assume that the driver has arranged sufficient buffering so that the jitter in scheduling across a bus (PCI, USB) is masked with respect to the client code. But you are correct that communications overhead can cut into your compute time if it is not addressed. Some audio APIs (e.g. CoreAudio) allow for configuring the buffering margins, so you can trade off buffer latency against available audio compute %. There is a whole world of debate surrounding how to best schedule audio compute (e.g. interrupt driven vs. delay-locked high precision timers).

Assuming the context is a desktop OS (which is the context of TFA), I think that the main source of non-determinism is scheduling jitter (the time between the ideal start of your computation, and the time when the OS gives you the CPU to start the computation). Of course if you can't arrange exclusive or max-priority access to a CPU core you're also going to be competing with other processes. Then there is non-deterministic execution time on most modern CPUs due to cache timing effects, superscalar out of order instruction scheduling, inter-core synchronisation, and so on. So yeah, you're going to need some margin unless you're on dedicated hardware with deterministic compute (e.g. a DSP chip).


No, I'm just talking about the common case where you have some other stuff going on before or after your own audio processing code: a software instrument your framework provides, some AudioUnits or gstreamer nodes adding other effects, the whole device chain in the DAW that's hosting you, etc. All of those things need to get done within your window so you can't use the whole thing for yourself.

Most people learning audio programming aren't making a standalone audio app where they do all the processing, or at least not an interesting one. They're usually either making something like a plugin that ends up in somebody else's bus/graph, or something like a game or application that creates a bus/graph and shoves a bunch of different stuff into it.


Definitely good to keep in mind. The thing that I think is really interesting about audio programming is that you need to be deterministically fast. If your DSP callback executes in 1ms 99.99% of the time but sometimes takes 10ms, you’re hosed.

I would love to see a modern take on the real-world risk of various operations that are technically nondeterministic. I wouldn’t be surprised if there are cases where the risk of >1ms latency is like 1e-30, and dogmatically following this advice might be overkill.


> dogmatically following this advice might be overkill

It depends on your appetite for risk and the cost of failure.

A big part of the problem is that general purpose computing systems (operating systems and hardware) are not engineered as real-time systems and there are rarely vendor guarantees with respect to real-time behavior. Under such circumstances, my position is that you need to code defensively. For example, if your operating system memory allocator does not guarantee a worst-case bound on execution time, do not use it in a real-time context.


I don't mean to devalue the advice here. I think it's spot on, and I unreservedly recommend this article to folks who want to learn about writing reliable audio software.

I think in essence I'm repeating the comments of Justin from Cockos, which you summarize [1]:

> It is basically saying that you can reduce the risk of priority inversion to the point where the probability is too low to worry about.

In that comment you also say:

> 100% certainty can’t be guaranteed without a hard real-time OS. However 5ms is now considered a relatively high latency setting in pro/prosumer audio circles

Which I interpret as acknowledging that we're already forced into the regime of establishing an acceptable level of risk.

My point is that I would love to see more data on the actual latency distributions we can expect, so that we can make more informed risk assessments. For example, I know that not all `std::atomic` operations are lock-free, but when the critical section is so small, is it really a problem in practice? I want histograms!

[1]: http://www.rossbencina.com/code/real-time-audio-programming-...


> If your DSP callback executes in 1ms 99.99% of the time but sometimes takes 10ms, you’re hosed.

I tend to agree, but...

From my recollection of using Zoom-- it has this bizarre but workable recovery method for network interruptions. Either the server or the client keeps some amount of the last input audio in a buffer. Then if the server detects connection problems at time 't', it grabs the buffer from t - 1 seconds all the way until the server detects better connectivity. Then it starts a race condition, playing back that amount of the buffer to all clients at something like 1.5 speed. From what I remember, this algo typically wins the race and saves the client from having to repeat themselves.

That's not happening inside a DSP routine. But my point is that some clever engineer(s) at Zoom realized that missing deadlines in audio delivery does not necessarily mean "hosed." I'm also going to rankly speculate that every other video conferencing tool hard-coupled missing deadlines with "hosed," and that's why Zoom is the only one where I've ever experienced the benefit of that feature.


The context for this article is writing pro audio software, where that kind of distortion would generally be as bad as a dropout, if not worse.


Yeah, 5ms is the threshold for noticeability as far as latency in pro-audio. Its like frame-rate for pro-gamers. The problem is your target user is highly specialized out side the the norms by a large margin. What makes audio even more difficult is that sub ms issues can cause phase and frequency distortion that can become even more noticeable than latency alone.


1. you do not need to be a highly specialized target user to detect latency between pressing a key on a MIDI keyboard and the corresponding sound being produced.

2. 3ms is typical in-air latency between a typical DAW user and their near-field monitors, so claims about sensitivity to times much lower than 5msec should be taken with some skepticism

3. In live contexts, many drum + bass pairings have more than 10ms of air latency between them, so ditto #2

4. On the other hand, no good reason to add to latency

5. For performance purposes, jitter is much worse than latency. Pipe organ players rapidly learn to deal with even whole seconds of latency, but almost nobody can deal with jitter (essentially, variable, unpredictable latency)

6. There are no sub-ms issues that will cause phase and frequency distortion. Those come from DSP errors, not handling of latency, which is just about always a constant, fixed feature of the data signal path. You may be thinking of stuff like comb filtering, but this is not related to the latency in the signal path in a correct setup.


The "MIDI timing" problem was often a combination of MIDI traffic limitations with limited CPU in the receiver.

What started off as a four note chord would be smeared out a little by MIDI, especially in the early days until everyone worked out that putting MIDI for an entire studio down a single cable was a bad idea.

Then you'd get some more smearing in the target synth CPU as the incoming notes were parsed. Then perhaps some more delay for each notes, because it took a while to send trigger and pitch messages to the hardware. Even more if there were if there were software envelopes involved and they had to be initialised.

This is still a problem with VSTs, on a smaller scale. There's some finite amount of processing that has to be done before sound starts being generated. Usually it's not very much, but there's always the possibility that two notes that should start in the same 5ms buffer slot will be spread across two of them because one note is just a little too late.

This isn't as objectionable as glitching, but it can still affect the timing feel, and - depending on the patch design - cause phasing effects between the notes.


1. MIDI traffic limitations are rarely the issue. The chord smearing that some people claim to be able to hear is not because of traffic but because the protocol is a serialized stream of individual note on/note off messages, and thus by definition there is no possible way for every message to arrive at the same time. However, the actual delays between a set of note on messages caused by the protocol is small enough that it is in the same range as human performance on both keyboards and string instruments. Note that MIDI has no collision detection or ACK-style replies, and you do not use "a single cable" for MIDI unless you have only 1 sender and 1 receiver. If it is a DAW sending "a lot" of MIDI to some external MIDI hardware, the only issues arise if the total amount of data to be sent exceeds the serial capacity of the hardware layer. This is not impossible to make happen, but even so-called black MIDI faces a challenge when doing this, even with classic (DIN) serial MIDI.

2. "parsing incoming notes" does not cause more smearing. Block-sized processing of audio causes a delay which is the "performance latency" that people complain about. It does not change the ordering or interval between note onsets.

3. the "finite amount of processing that has to be done before sound starts being generated" is irrelevant in a block processing architecture (which is used these days by all DAWs and all plugin APIs). As long as the plugin gets its work done within the time represented by the block,there is no additional latency caused by the plugin. If it doesn't, then there's a click anyway.

4. "there's always the possibility that two notes that should start in the same 5ms buffer slot will be spread across two of them". No, there isn't, If that happens, that's a coding error in either the plugin host or the plugin or both. But also, time is continuous. If the notes are supposed to be 3msec apart, it doesn't matter if they are 3msec apart within the same buffer/process cycle, or in two consecutive ones.


I don't know. When it comes to real-time audio... imagine a huge festival with a giant wall of speakers blasting at the audience. If the audio playback glitches and you something like a 22kHz buzz (alternating two samples), that is a lot of fried ears.


This scenario is the stuff of nightmares for me!

When you have 100k people paying $500 to the sky is the limit, failure is not an option. Increasingly audio engineers and subsequently performers are at the mercy of the latest jr developers who don’t have to live with the failures of their short sightedness. Grimes’ Coachella set case in point. Wholly due to pioneer ignoring their users for over a decade. Sometimes we don’t have 3 days to copy files to a usb drive but I digress.


Apparently, failure was an option. Just not a very popular one.


Grimes failure was still pleasant when you compare it to the mayhem you get when the DSP inside the amplifier system glitches for a few samples.

What do you think happens a dense crowd of 500+ people suddenly starts to have excruciating ear pain?


Is this something that happens often or are you simply speculating?


My understanding is that in practice, for very large shows, electronic musicians have fully redundant computer setups running in parallel and some hardware that will switch over instantly if one fails.

For example, here is one rig:

https://www.reddit.com/r/ableton/comments/7y2u3o/ableton_mai...

It uses a Radial SW8 to automatically switch between the redundant machines if one flakes out:

https://www.radialeng.com/product/sw8


If failure is not an option you bring 2 computers to every gig, burn CDs, and bring your vinyls.

Grimes is a "dj" that does not understand the software. Fixin that problem is one fucking click on the interface.


Heh, I thought it was odd you referenced a ten year old show, but I guess she made a similar mistake twice. Her 2014 Coachella set was a total mess.


But you'll never be 100% sure. Most musicians aren't willing to pay for NASA-level QA and custom hardware running an RTOS, and even that doesn't guarantee perfect software.

We're always dealing with risk and trade-offs. Maybe you avoid a locking `atomic` synchronization point by implementing a more complicated lock-free ringbuffer, but in the process you introduce some other bug that has you dumping uninitialized memory into the DAC.

I think the advice in TFA is totally reasonable and worth following. I'm just saying that there may be cases where it's OK to violate some of these rules. I'd love to see more data to help inform those decisions.

This isn't even in opposition to the article, which says explicitly:

>Some low-level audio libraries such as JACK or CoreAudio use these techniques internally, but you need to be sure you know what you’re doing, that you understand your thread priorities and the exact scheduler behavior on each target operating system (and OS kernel version). Don’t extrapolate or make assumptions


Off topic but does TFA still anyways mean "The Fucking Article"? In my personal understanding it came from people telling others to "read tfa". But to see the term used ubiquitously referring to "the article" but keeping the profanity just seems kinda strange to me. We could say something like "TA" and omit the "fucking" but maybe it actually means something completely different and my personal lore has just detached from the zeitgeist


I think it does mean "the fucking article", but I also think a lot of people use it as "the featured article". I agree with you though, it's a bit confusing sometimes as the less nice usage is still also common hahaha.


The real fun is optimising maths. Remove all divisions. Create LUTs, approximations, CPU specific tricks. Despite the fact CPUs are magnitudes faster now, they are still slow for real time processing.


Real time does not mean fast, it means deterministic

Thus such micro optimizations are seldomly used. Quite the opposite, you try to avoid jitter which could be the result of caches


While real-time does not mean fast, micro optimisations are frequently used. No one likes slow DSP audio software.


> No one likes slow DSP audio software.

And then there's Diva at its highest output quality setting... :)


Yes, I did think twice about posting that precisely because of Diva.


Jitter does not matter if deadlines are met. It only matters if it can cause deadlines to be missed (sometimes).


If you have a buffer that's being clocked out and your goal is to keep data flowing, the jitter is going to influence how small your buffer can be. Let's say you're producing 56Khz audio, the best you can do is produce a [sample] exactly at that frequency. If you have 1ms jitter now you need a 1ms buffer so you have delay. If jitter is small enough, like 0.1ns jitter in some SIMD calculation, then for all intent and purpose it doesn't matter for an audio application...


You've just restated my point. If the deadlines are met, jitter doesn't matter. Ergo, you can't meet deadlines if your jitter is too large. Otherwise, it doesn't matter.


Wouldn't the deadline be now+zero for real time audio applications? If I'm building a guitar pedal (random example) ideally I want no delay from the input to the output. Any digital delay makes things strictly worse and so any jitter matters. That said, the difference between zero and very close to zero does become a moot point given small enough values for any practical purpose.


There are some digital audio systems that do sample-by-sample processing. Old school digidesign, for example.

But very little digital audio gear works that way these days. The buffer sizes may be small (e.g 8 or 16 samples), but most hardware uses block structured (buffer by buffer) processing.

So there's always a delay, even if 1 sample.


Basically "It doesn't matter when it doesn't matter".


> Create LUTs

This has been slower for most things that raw computation for well over a decade (probably more like two).


If there are complex equations involved, it absolutely is faster. You can also create intermediate LUTs, so the tables are small and fit in cache and then do interpolation on the fly.


Not at all, when you work with DSP even nowdays using LUTs is very common and usually faster.

You are not saving a sin table, but very complex differential equations.


Yeah, isn’t hitting memory (especially if it can’t fit in L1-2 cache) one of the biggest sources of latency? Especially that on modern CPUs it is almost impossible to max out the arithmetic units, outside of microbenchmarks?


You don't really do these any more on a modern CPU. This is stuff I used to do 30 years ago and you might still do if you're on a micro-controller or some other tiny system. The CPUs aren't slow. Tne main problem is if the OS doesn't schedule your process it doesn't matter how fast the CPU is.


This is great fun! But it's much more prevalent in embedded DSP than desktop.


> deterministically fast

Indeed, like all real-time systems you need to think in terms of worst-case time complexity, not amortized complexity.


Use of Ethernet in real-time systems. Packet loss, collision rate, jitter is """good enough""" so it became an acceptable replacement of eg. ATM.


Yes. Most modern Ethernet isn't running on shared media (i.e. there are no collisions) and for the most part no packet loss as long as there's no congestion. For networks and for the CPU, when you're fast enough the jitter matters less, if the cpu or the network "takes a break" (from the application perspective), it tends to be a very short break on really fast networks or cpus. e.g. if a packet gets in front of you in 10Mbps Ethernet that's a big deal for an audio application but a packet ahead of you in 10Gbps Ethernet isn't much of a delay for audio. 1ms vs. 1us sort of thing.

[fixed typo]


Or you use AVB/TSN which gives you stronger guarantees, but requires cooperation of all bridges (switches).


Notice that games are often able to render each frame in 5ms: which in practice means run multiple short programs for each pixel you see on the 4k screen. So modern computers are able to do huge huge amount of computation in 5ms (in the order of 10^10 flops, 10^8 bytes). If puny kilobytes of audio data cannot be processed in 5ms it means things are terribly wrong.


Games do a very impressive amount of work for graphics but there's a huge difference: a dropped/late graphics frame every now and then is not a big deal.

An audio glitch is very annoying by comparison, especially if the application is a live musical instrument or something like that. Even the choppy rocket motor sounds of Kerbal Space Program (caused by garbage collector pauses) are infuriating.

It's kind of the difference between soft and hard real time systems. Although most audio applications don't strictly qualify as hard real time (missing a deadline is as bad as a total failure) but failing a deadline is much worse than in graphics.


Frame pacing issues due to dropped frames are absolutely a huge deal in games.


Yes, but audio glitches are much worse for the experience.


The GPU is able to paralelizes the drawing of each pixel.


Missing a draw and thus displaying the previous screen for a frame is not noticable in general.


Depends on where you run your audio. I've worked on embedded devices where we could just use isolcpu and IRQ controls to make sure specific cores were only ever used for the audio thread of the app.


What do you mean by "bus" and "module" in this context?


"Bus" (as I was using it) is the path from some audio source to some audio destination and a "module" (as used) would be something that takes a buffer of samples on that bus and does something with it.

You might sometimes build an app where (through your operating system) you connect directly with an input device and/or output device and then do all the audio processing yourself. In this case, you'd more or less control the whole bus and all the code processing samples on it and have a fairly true sense of your deadline. (The OS and drivers would still be introducing some overhead for mixing or resampling, etc, but that's generally of small concern and hard to avoid)

Often, though, you're either going to be building a bus and applying your own effects and some others (from your OS, from team members, from third party plugins/libraries, etc) or you're going to be writing some kind of effect/generator that gets inserted into somebody else's bus in something like a DAW or game. In all these cases, you need to assume that all processing code that isn't yours needs all the time that you can leave for it and just make your own code as efficient as is reasonable.


Thanks for clarifying. The terms are highly ambigious (see the sibling answer https://news.ycombinator.com/item?id=40930298), that's why I asked. Personally, I would rather use the terms "audio pipeline" or "audio graph" instead of the generic "bus".

> In all these cases, you need to assume that all processing code that isn't yours needs all the time that you can leave for it and just make your own code as efficient as is reasonable.

Yes. For audio programmers that is obvious, in particular when it comes to plugins, but for novices it might be worth pointing out!


> You might sometimes build an app where (through your operating system) you connect directly with an input device and/or output device and then do all the audio processing yourself.

In case it is not clear, that is the primary case that is addressed by the linked blog post (source: I wrote the blog post).


And likewise: in case it wasn't clear, it's a great article! I wasn't meaning to criticize it, just add a little further perspective for the common scenario that many first-time audio programming folks encounter.


A module is a piece of software or hardware which is independent in some way.

A bus is a shared medium of communication[1]. Often, busses are time-division multiplexed[2], so if you want to use the bus, but another module is already using it, you need to wait.

For example, if your audio buffers are ultimately submitted to a sound card over a PCI bus, the submission may need to wait for any ongoing transactions on the PCI bus, such as messages to a graphics card.

[1]: https://en.wikipedia.org/wiki/Bus_(computing)

[2]: https://en.wikipedia.org/wiki/Time-division_multiplexing


That is one possible interpretation, but not what they meant. That's why I asked because I wasn't sure :)


Great resource! For those interested in learning the fundamentals of audio programming, I highly recommend starting with Rust.

the cpal library in Rust is excellent for developing cross-platform desktop applications. I'm currently maintaining this library:

https://github.com/chaosprint/asak

It's a cross-platform audio recording/playback CLI tool with TUI. The source code is very simple to read. PRs are welcomed and I really hope Linux users can help to test and review new PRs :)

When developing Glicol(https://glicol.org), I documented my experience of "fighting" with real-time audio in the browser in this paper:

https://webaudioconf.com/_data/papers/pdf/2021/2021_8.pdf

Throughout the process, Paul Adenot's work was immensely helpful. I highly recommend his blog:

https://blog.paul.cx/post/profiling-firefox-real-time-media-...

I am currently writing a wasm audio module system, and hope to publish it here soon.


I have mixed feelings on cpal: on the one hand, it's been really wonderful to have a library that just works on different platforms. On the other hand, it's an absolute pain in the butt for doing anything simple. I really wish it would have a simple interface for when I'm only worried about floating point data (I ended up creating my own library to wrap cpal's idiosyncrasies for my mixed midi/audio node program: https://github.com/smj-edison/clocked).


Is it still the case that cpal doesn't support "synchronous" duplex audio where the program inputs audio from a source and outputs it to a sink (either with feedback or outputting unrelated audio), with an integer number of periods (as little as 2) of software-level latency if you copy source buffers to the sink? Last time I used it, each stream is opened in input or output mode and opening both does not run with any guaranteed timing relation.


Is there a good "toolbox" style cli for audio? Like pitch shifting and time stretching etc


sox


As a web developer, learning music and audio programming makes my mind melt. We often say "real time" when we mean "fast." But in audio real time means "really fast, all the time" and somewhat deterministically.

If your tempo drifts, then you're not going to hear the rhythm correctly. If you have a bit of latency on your instrument, it's like turning on a delay pedal where the only signal coming through is the delay.

One might assume if you just follow audio programming guides then you can do all this, but you still need to have your system setup to handle real time audio, in addition to your program.

It's all noticeable.


> We often say "real time" when we mean "fast." But in audio real time means "really fast, all the time" and somewhat deterministically.

As a former developer of real time software, the usage of "real time" to mean "fast" makes me cringe a bit whenever I read it. If there's a TCP/IP stack in the middle of something, it's probably not "real time."

"real time" means there's a deadline. Soft real time means missing the deadline is a problem, possibly a bug, and quite bad. Hard real time means the "dead" part of "deadline" could be literal, either in terms of your program (a missed deadline is an irrecoverable error) or the humans that need the program to make the deadline are no longer alive.


And to demonstrate that hard realtime is not about speed, there is a whole hard-real time JVM implementation with GC and everything used in military contexts.

Modern computers are ridiculously fast, relatively speaking you don’t need much resources to calculate a missile trajectory, so “simply” 100% sure doing some calculations at a fixed rate, with even a GC cycle that has a deterministic higher bound (e.g. it will go throw the whole, non-resizable heap, but it will surely always take n seconds), you can pass the requirements. Though a desktop computer pretty much already begets the hard part of hard realtime, due to all the stuff that makes it fast - memory caching, CPU pipelining, branch prediction, normal OSs scheduling, etc.


What is it like working with hard real time? As in, is there some tooling to determine how long a given function will take to run to meet a deadline? It starts to sound like an NP problem, but we have other kinds of correctness and proof testing.

I suppose it's hard to make guarantees with different environments and hardware, but I realized when we (non-realtime people) ship software we don't really have guarantees for when our functions run.


It's just a very different programming model. You don't write potentially infinite loops, you allocate memory ahead of time, anything that might be potentially unbounded is handled asynchronously, etc. Things like I/O run on precise timers and so on.

Keep in mind you don't use the same operating systems (or often even the same hardware) in hard real time applications. You'll use a real time operating system (FreeRTOS, VxWorks, etc) with a very different task scheduler than you're probably used to with processes or threads in Unix-like platforms. That said, while multitasking exists for RTOSes in practice you're not going to be running nearly as many tasks on a device as say a web server.

You can get worst case performance of a section of code by guaranteeing that it has a fixed maximum number of operations (no infinite loops, basically). Of course the halting problem applies, but you're never concerned with solving it in the general case, just for critical sections. It gets tricky with OOO architectures but you can usually figure out a worst case performance.


if you think that's cool then you may also like a "hard" real time programs e.g. ABS brakes https://en.wikipedia.org/wiki/Real-time_computing#Criteria_f...


You can actually do music/audio programming in a browser. It has some rough edges but it works.


Absolutely you can. With WebAsm SIMD you have near-native DSP performance. Downsides from my experience [1]:

- You are at the mercy of the browser. If browser engineers mess up the audio thread or garbage collection, even the most resilient web audio app breaks. It happens.

- Security mitigations prevent or restrict use of some useful APIs. For example, SharedArrayBuffer and high resolution clocks.

[1] https://bungee.parabolaresearch.com/bungee-web-demo


> Applications where low latency is especially important are (1) interactive audio systems (such as musical instruments or DJ tools) where the UI needs to be responsive to the performer, and (2) real-time audio effects, where the system needs to process analog input (say from a guitar) and output the processed signal without noticeable delay.

It's worth noting that these are practically the only case where extreme real-time audio programming measures are necessary.

If you're making, for example, a video game the requirements aren't actually that steep. You can trivially trade latency for consistency. You don't need to do all your audio processing inside a 5ms window. You need to provide an audio buffer every 5 milliseconds. You can easily queue up N buffers to smooth out any variance.

Highly optimized competitive video games average like ~100ms of audio latency [1]. Some slightly better. Some in the 150ms and even 200ms range. Input latency is hyper optimized, but people rarely pay attention to audio latency. My testing indicates that ~50ms is sufficient.

Audio programming is fun. But you can inject latency to smooth out jitter in almost all use cases that don't involve a live musical instrument.

[1] https://www.youtube.com/watch?v=JTuZvRF-OgE&t=490s


Shooting a gun (or whatever) in a game and "waiting" for the sound is extremely isomorphic to pressing a key on a MIDI keyboard and "waiting" for the sound.

Yes, background sound in games can be handled with very large buffers, but most players expect music-performance-like latency for action-driven sound.


Games don't have particularly large buffers. There's just a very long pipeline with lots of buffering. It's honestly pretty bad. But almost no one measures and it doesn't actually matter to players.

Musicians have keenly trained ears. I would imagine their much more sensitive to audio latency than even a pro gamer, nevermind average Joe off the street.

Where latency really matters is when you have a musical instrument that plays a sound and it's connected to a monitor. If those sounds are separated by more than 8ms or so the difference will be super noticeable to anyone, including Joe off the street.

I'd be interested for someone to run a user study on MIDI keyboard latency. I'd bet $3.50 that anything under 40 milliseconds would be sufficient. Maybe 30 milliseconds. I'd be utterly shocked if it needed to be 8 milliseconds. And I'd be extremely shocked if every popular MIDI keyboard on the market actually hit that level of latency.


MIDI -> audio latency is mostly a function of the MIDI receiver, not the MIDI sender. So the MIDI keyboard is not where the latency comes from (typically).


One of my pet peeves is that graphics programming does not generally enjoy this level of rigor like audio does, despite arguably being also real-time. This is doubly true for UI programming, I'm not aware of any UI toolkits designed with real-time in mind.

I would love to see a UI system that has predictable low-latency real-time perf, so you could confidently achieve something like single frame latency on 144Hz display.


On the other hand, the throughput requirements on image processing are often far more stringent than in audio. A full-resolution stereo audio stream is only 200kb/s = 12 Mb/min after all, whereas a video system might have to chew throw many gigabytes in the same time.


Well, if you allow a bit of a sloppy terminology, graphics are “softer” real-time, than audio - a frame drop is less noticeable than audio glitches.


That's where realtime and fast differ : you can consider there's a "hard"-er real-time at 100ms by example, where you can drop some frames but you don't want to block the ui. Be slow but don't block the UI.


When I last made a similar point someone pointed out to me that this has gained a higher priority recently thanks to VR. Janky framerate is going to be much more noticeable with a headset than with a traditional monitor, and could even result in lost lunches!


The tiniest audio glitch is instantly noticed and extremely annoying. So extreme efforts are put into preventing them. If your software has audio glitches people will really stop using it.

A graphics micro-stutter not so much.

> I'm not aware of any UI toolkits designed with real-time in mind.

What would be the point? The human eye can only notice so much FPS (gamers might disagree with their 244 FPS displays).


Not just annoying, but potentially dangerous. An audio glitch could concievably deafen 2 thousand people at a show if things go wrong.


All games seem to manage this somehow. Usually by using fixed layout for the UI, because re-layout is a nasty non parallelizeable problem that UI designers frequently inflict on themselves.


A timeless classic! This is the first thing I always recommend to anyone interested in real-time audio programming.


Timur Doumler's videos on the topic are also pretty good and bring some new methodologies to the table:

https://youtu.be/zrWYJ6FdOFQ

https://youtu.be/vn7563IAQ_E

https://youtu.be/7fKxIZOyBCE


Good set of videos. Here is the article version of the first: https://timur.audio/using-locks-in-real-time-audio-processin...

The insight is that with two threads contending on one lock, there are efficient ways to build the lock that minimizes cpu on the non-realtime thread.


Off topic. Anybody else like Thursday Next? Had to think of "Time waits for no man!"


Slightly tangential, does anyone know any good (windows based) DSP software? EquilizerAPO is decent in theory but beyond being clunky to use unfortunately doesn't even seem to work 90% of the time.


Do you mean software that can modify audio streams via the Windows audio system (as opposed to an ASIO driver used by DAWs)?

DDMF's VirtualAudioStream does that. It allows you to create virtual audio devices with chains of arbitrary VST plugins. As for the VST plugins, there are thousands of free and paid plugins for everything. I'm using VirtualAudio stream to put a Wave's noise cancelling and a good compressor between my mic and Zoom. It increases latency, of course.


> Do you mean software that can modify audio streams via the Windows audio system

I think so. TBH I'm quite new to the world of DSPs so I don't know the right terminology. The purpose of the DSP (which I should've mentioned in my original post now that I think of it) is to tweak the speakers on my laptop - there are for example ways to "fake" bass (through missing harmonics), or have dynamically changing bass. I'll have a look at VirtualAudioStream, thanks for the recommendation.



Graphical: Pure Data, Max/MSP

Text based: SuperCollider, Csound, Chuck



I gave up on software and bought a steinberg ur44c. Almost zero latency for eq and compression so you can monitor your self in real-time.


AudioMulch?


I wish you were still actively developing it, I haven't been able to run it on Mac for a while now. More than any other recording software, AudioMulch gave me a taste for digital audio and specifically realtime resampling. Since using AudioMulch, any DAW that doesn't let me record arbitrary combos of audio streams in realtime feels hopelessly limited.


I'm still a fan ~25 years later


I think camilla dsp works for windows.


SuperCollider


www.airwindows.com


If you’re worried about glitches during live performances there’s a fool proof solution: play real instruments. ;)


That's great if you don't need any amplification at all, or perhaps nothing more than a typical guitar/bass amplifier. Big venues need a lot more hardware than this.


“Big venues”. lol. I’m sure when Metallica played in russia in 1991 for the 4th largest concert in history it didn’t really happen because according to you they needed digital (glitchy) hardware.

What are you even talking about, man.


They don't have the old stuff any more. It's like trying to land men on the Moon now: the Apollo hardware is all gone, except the stuff in a museum that doesn't work any more.


That’s a good way to put it. But do mixing boards today really do ADC then DAC?!? Seems utterly pointless.


Why? (Disclaimer: I'm not a professional in that industry, so I don't really know what I'm talking about.) Just as one example, the microphones and guitars these days are all connected to the mixing board through wireless radios, which we can assume are using digital protocols (just like all other wireless comms these days). Long gone are the days of ridiculously long cords that musicians trip over on stage. So at least that portion of the audio path has to be digital. And from what little I've been able to see of a modern mixing board from the audience, they look quite modern, with computerized displays, so of course they're going to be working with the audio digitally.


Wireless transmitters don’t immediately imply digital. But I see your point. I haven’t done sound reinforcement in since the 90s. The body mics we used weren’t digital. Shit I’m old. I’ll just stop talking because my initial post was an attempt to channel Randy Marsh showing the gang how to play a real guitar instead of Guitar Hero but then I talked out of my league. lol.


:-)

I think things have just changed technologically a lot. Back in those days, the entire audience didn't all have wireless radios in their pockets, so surely the issue of EM interference is much, much worse now. Also, radio systems these days are almost all digital, probably because of interference partly, but partly also because that's just where all the tech is: we know how to make good digital radios now, they're readily available and cheap, so if you're designing a new product that needs a radio, why do it any other way? It's like programming languages: if you need to write a new program for work, you'll probably write it in something currently in-vogue that lots of other engineers are familiar with, like Python or C++ or whatever, and not something old like Lisp.


A digital synth is a real instrument and electronic music is real music.


... and if the FOH engineer is using a poorly engineered digital console, or even a DAW, for mixing ... ?


or embrace glitch in your music.


i thought glitch music a good


This seems super outdated. Isn't CoreAudio HW accelerated now?


It continues to be relevant. On modern computers, mobile especially, there are are more and more things that can glitch audio from being actually real-time. Probably one of the worst offenders is power management; it is extremely likely that the power management governor has no awareness of upcoming audio deadlines.

Obviously the happy case is when all the audio processing is done in a DSP where scheduling is deterministic, but it's rare to be able to count on that. Part of the problem is that modern computers are so fast that people expect them to handle audio tasks without breathing hard. But that speed is usually measured as throughput rather than worst-case latency.

The advice I'd give to anybody building audio today is to relentlessly measure all potential sources of scheduling jitter end-to-end. Once you know that, it becomes clearer how to address it.


How do you measure scheduling jitter of audio drivers and userspace? I hear real-time priority or kernels improve latency? (There was some controversy about Ubuntu Studio or something switching to RT kernels with more missed deadlines than RT-PREEMPT, but I don't know how to quantify this stuff.)


That's actually a very good question. To some extent, performance analysis tools may be able to give you high-accuracy timestamps of things involving context switches and other things that can cause scheduling jitter. If you can get access to things like the fill level of FIFO buffers, even better. You may also be able to do experiments like cutting buffer sizes down to the bone to see how low they can go without glitching.

Of course, it's not unusual that the many layers of abstraction in modern systems actively frustrate getting real performance data. But dealing with that is part of the requirements of doing real engineering.


For context switch timestamps would I be using something like perf on Linux?

Would buffer level checking entail, eg. checking snd_pcm_avail() when my audio code begins and ends (assuming I'm talking directly to hardware rather than PipeWire)? Dunno if PipeWire has a similar API considering it's a delay-based audio daemon and probably checking snd_pcm_avail when a client requests would slow it down.


Buffer level makes almost no sense.

All that matters is/are:

1. how soon after the device deadline (typically marked by an interrupt) until a kernel thread wakes up to deal with the device? (you have no control over this)

2. what is the additional delay until a user-space application thread wakes up to deal with the required data flow? (you have very little control over this, assuming you've done the obvious and used the correct thread scheduling class and priority)

3. does user space code read & write data before the next device deadline? (you have a lot of control over this)

As noted above, cyclictest is the canonical tool for testing the kernel side of this sort of thing.


cyclictest(8) is the canonical tool for starting out down this pathway. That measures basic kernel stuff relevant to this area of inquiry.

However, since actual context switch times depend a lot on working set size, ultimately you can't measure this stuff accurately unless you instrument the actual application code you are working with. A sample playback engine is going to have very different performance characteristics than an EQ plugin, even if there is theoretically more actual computation going on in the latter.


also interested in the answer to this




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: