Totally off topic, but I've started to notice that more and more algorithmic low-ish level content assumes Rust by default. My entire life I've been used to anything algorithmic being either written in sciency pseudocode (invariably in latex-generated PDFs), or plain old C(++), the vulgar latin of computer programming. I know C and I don't really know Rust but nevertheless I love that this is changing! I'd never advise a new programmer interested in low-level/fast stuff to begin with C now, so much good content is written around Rust. Like the old C examples, the post doesn't even say "this is Rust". You're expected to expect it to be Rust.
And, well the language choice doesn't really matter for the core subject here. I could follow this post fine even though I don't really know Rust (just like I assume a Rust programmer could follow a well written algorithms deep dive with examples in C with a bit of dedication). EDIT okok that was a lie, I could follow the first half of this post, ish, and then it got way too deep for me.
But anyway, standardize on Rust, I like it. I'd love if a bit more Zig was thrown into the mix here but either of them feel like a better default than C at this point to me. After decades of C being The Standard for this stuff, I love that this is finally changing.
Generally Zig is going to be more intuitive for C developers (people who prefer more low-level and procedural style coding), while Rust would be more intuitive/conducive to C++ developers (those who prefer more abstractions and language level features/guarantees).
Which is totally fine, there's a reason C++ never completely subsumed C; both are valid avenues.
Not exactly sure about it myself either. There are things I like about it but at the same time there others that sort of bother me (uncertainty about whether the language will stick around, ugly syntax in certain cases). Being in Linux is impressive for a language, but I suspect most will stick to C.
I think it's almost irrelevant how good Rust is as a language. Rust is winning in many contexts, because you can use it to replace C++, but the standard library and tools are substantially better.
The entire user experience is so much nicer that even if Rust was slightly worse I would prefer it. I don't miss header files, makefiles, not having a package manager, maintaining all the crazy code people wrote themselves as pet projects instead of using an existing project in an ecosystem. Thankfully its not worse, so I don't even have to weigh the odds.
Rust is great for pet projects. Otherwise, it is a big security risk for large commercial projects, since it lacks a useful standard library and any non trivial program needs to download a thousand crates to work.
Apart from `simd`, and if you allow us to embelish by using Tokio and I'll ignore the C++'s reliance on Boost, there's not really much I can see from C++'s standard library that's anyhow wildly different from Rust.
What in particular do you see lacking in Rust that you can do in C++?
C# and other high level languages like Java are much safer than any low level language including Rust, in those languages you don't even have an unsafe mode
Have you ever heard of Java native Interface or Java native abstractions? Thats unsafe AF.
Same goes for native implementations. Whenever something needs to be done fast, the jvm implements native code. Have you ever debugged that magic bs? Guess why people chose a high level language, because they don't want to bother with complicated stuff. Still those abstractions on these languages are not better, even worse considering their APIs, than rusts.
Working with thread locals or synchronized keywords, non atomic members and all the bad concurrency mechanisms inside Java is unsafe shy default.
Runtime exceptions is what I consider a bad practice and it's spread everywhere in Java... Also the language evolutions are not really that good. So bringing errors/exception to the compilation time is a safety net.
Despite the fact that languages with managed memory safety bring a lot of potential for abusively high memory consumption, random GC stop the world situations and worse like oom.
Whenever I had to analyze a memory issue with an application it was no fun at all, despite the challenge.
The jvm specifications are so flawed considering security... Creating classes at runtime from a stream using reflections, which is totally spec compliant gave us log4shell. Why even load serialized classes from remote?! Fail by design.
So safety is still context sensitive and depends on the requirements. If you write bad software you can ignore some of the mentioned aspects, though it still lacks of 'security'
C# absolutely have an unsafe mode, basically since version 1. You can use pointers, deal with stackallocated memory and all of that, in recent versions you can even natively allocate and deallocate memory using the `NativeMemory` type. Of course, it is harder to cause problems and the `unsafe` isolated areas provide a similar unsafety abstraction mechanism as Rust does, but it is still possible.
And in the edge, both C# and Java have FFI and can call directly into C/C++ (or basically any native language) code, so you can still open a big security hole in your application (and nuget makes it very convenient to even ship native binaries in your packages, as you can see with the many wrappers around things like libsodium, raylib, windows/linux API's and more).
In the end of the day, C# had the same ideas of isolating unsafety that Rust has, and provides many safe wrappers around good libs, like Skia (SkiaSharp, very famous and widely used) for example. So those ideas are kinda market proven as well.
.NET languages have always had access to unsafe feature subsets at multiple levels of abstraction with varying degrees of "unsafety". .NET has also been slowly leaning more into its systems-programming side. It is now at a strange crossroads where it is not like C or C++, or as a much better replacement, Rust but not like Java or Go either.
If you're embedding, say, a C library inside of Rust, you can do so via a so-called "build.rs" script [1]. These run at build time so you can do pretty much whatever you want. They're not necessarily pretty, but given that Cargo isn't going to support other languages natively, it's often a less-evil option.
Then there are build systems that natively support Rust in addition to other languages and know how to plug them together [2] [3] [4]. If you're already using one of these existing tools, then you don't need to roll your own, you can plug it into your existing project.
Of course if you really are using Makefiles in C you're already living in the wild west, so it's on you to figure out how to add Rust to your projects in that case (or switch to a better build system).
Well, I don't use makefiles to deploy software with Rust. I also have never used lex or yacc, but I bet there are similar tools in the ecosystem, or wrappers for those. That would obviate what I will offer below.
I rarely use multiple languages with Rust. A lot of interpreted languages have bindings through crates and can go in to a project through Cargo (python, etc). If it involves JS/TS on desktop, I'm usually using Tauri for that. Guess it depends on the system?
Hopefully that helps. You can also still use a Makefile if you want I just haven't dealt with one in a long time.
I think their point was that Rust has a batteries included package manager (cargo). Another advantage of which being that all rust projects are pretty standard in that particular aspect.
I think the main problem is that Rust doesn't allow for basic data structures like graphs or doubly-linked lists without extra effort, which is why I've generally always preferred pseudocode.
I often hear this and am confused; not only are things like 'object soup'[0] possible and straightforward (putting things in collections and referring to them by indices), I never concretely hear why a graph or doubly-linked list becomes uniquely difficult to implement in Rust (and would genuinely be curious to learn why you feel this way). If you needed such data structures anyway, they're either in the standard library or in the many libraries ('crates' in Rust-lingo) available on Rust's package registry[1]---using dependencies in Rust is very straightforward & easy.
- From the perspective of e.g. C++, I don't think it's reinventing the heap. We often prefer to use indexes into a std::vector or similar, even when putting everything in std::shared_ptr is an option, because the dense layout of the vector gives us the blazing fast cache performance we crave. It also avoids memory leaks from reference cycles.
- From the perspective of e.g. Python, yeah it's reinventing the heap. Unless you need to serialize the whole collection into JSON or something, it's much less convenient. But (almost) no one uses Rust instead of Python just for the convenience. (There are dozens of us! Dozens!)
One of the reasons to reach for a low level language is to have control over tradeoffs specific to your use case instead of using some pre-packaged option. Data structures are a very good example. Generic data structures like those in the standard library will likely suffice for many use cases, but a low level language must give the user the control to write their own. It’s just table stakes for this type of language.
Thankfully, rust does give you that level of control, you can write your own doubly-linked list implementation if you want, and you can pick your safety-overhead tradeoff.
Not sure about Rust std. But on cpp I heard a lot low-level companies avoid using its standard library since it is too bloated and optimized for too many cases.
Well, one can always emulate that in Rust with array indexes. And it will be closer how the actual CPU works where a pointer is just an offset into a memory chunk.
Perhaps this is the reason performance oriented articles prefer Rust. In allows to skip a lot of memory allocation while still catching memory-safety bugs, which is a hard problem with C++.
How could something to implement a pointer be more straightforward than a pointer? People seem to like Rust in excessive, sometimes botheringly excessive proportions.
I don't think the indexing strategy is more straightforward in general. It's a thing you have to learn, and the fact that you have to learn it is part of Rust's famously steep learning curve. That said, there are some common cases where you might wish you'd used indexes instead of pointers in other languages too:
- Working with resizeable collections like std::vector in any non-GC language, where resizing invalidates pointers.
- Serialization. If your objects contain indexes to each other, it's easy to turn them into e.g. JSON. But if they contain pointers to each other, it's tricky.
Not easily. But not all use cases require it. The regex crate makes heavy use of this pattern for finite machine states, for example. But this fits nicely with an arena allocation pattern, so everything can just be dropped at once.
I'm not sure whether this is exactly what you mean, but when you need to support deletions you can switch to a "generational" collection like Slab or SlotMap. Your indexes are still integers, but under the covers some of the bits get repurposed to disambiguate new elements that reuse old slots.
Those are fairly straightforward to implement if you are ok using a reference counted type (along with weak references), although that will hurt performance. That would be similar to using a shared_pointer in c++.
And you can implement code similar to what you would in c or c++ if you use raw pointers and unsafe code.
Where you run into difficulties is if you want something that is fast and safe. And then you need a more complicated design that probably involves using indices into a Vec instead of pointers.
Using a Vec actually has some benefits beyond the safety guarantees from rust. There are fewer allocations, and you probably are more likely to get nearby nodes on the same cache line.
It does neither enforce correctness nor safety. It enforces memory safety in programs that stick to the safe subset. But don't get me wrong, Rust certainly has many good ideas and the focus on memory safety is good. I still do not think it is a great language.
The problem with rust is that it’s has lots of features but isn’t c like. Python doesn’t have much syntax so it’s fine to be not c like but rust has 100% new syntax and a lot of syntax with respect to c. I can pick up c like languages easy but rust just makes me mad to look at.
Nah just the final bits, stuff like `Simd::<u32, 8>::splat(q)`, I'm not sure what splatting is or how Rust's Simd library works or, admittedly, how SIMD works in detail at all. So eg I'm not sure what that 8 is doing there in the type parameters, details like that. Maybe this isn't a Rust thing but a SIMD thing, btw, I don't know much Rust but I also never wrote any SIMD code ever. I don't know how the article could be clearer, IMO once you go into the deep nitty gritty optimization stuff you just gotta assume the reader knows the tools you're using. I'm OK with dropping out halfway cause the core idea is there even before you squeeze out all the perf.
`Simd::<u32, 8>` is describing a vector with 8 lanes, where the width of each lane is 32-bits. For targets that support it, that should get you a 256-bit register.
The `splat(q)` constructor is just saying, "populate each lane with `q`." That is, the value is "splatted" across every lane in the vector.
The main idea of SIMD is simple: you represent multiple points of data in the same register and use specialized CPU instructions to operate on those points simultaneously. The difficulty of SIMD is coming up with clever algorithms that use the available instructions in an efficient way.
Ahh right, I've been doing too much TypeScript lately. I thought a type parameter couldn't impact behavior but only typechecking but clearly in Rust (like in C++) it can. Thanks for the explanation!
Ah so splat is like a memset for the entire vector to the same value. OK makes sense, I bet that wasn't actually a Rust-ism at all, just basic SIMD lingo I didn't know. Thanks again!
wow yeah, i way prefer Rust's notation there! but i assume you could make a similar zero-cost abstraction with some mildly crazy C++ template programming.
yeah tbh i was a bit starstruck that burntsushi would reply to my comment explaining what i assume is utter SIMD basics. that must be how swifties feel when she (or, likelier, her PR drones) click "like" on their insta comments.
Because that’s the original problem statement for Algol! As in, we’ve been publishing all that pseudocode for quite a bit and it seems like conventions have emerged, let’s formalize them and program in that. ... People were markedly more naïve back then, but then that’s sometimes what it takes.
I don't know if it's good or bad, with pseudo code you put no constraints on how it should be implemented. It's known that some kind of algorithm are really hard to implement in Rust (every one use the link list data structure as an exemple). So having the article that use rust is good to see it can fit to rust constraints but at the same time does this constraint limit the algorithm itself ?
Quite honestly, doesn't seem like Rust is a win here over C++. In fact, C++ has fewer sigils that make it somewhat easier to read sort of striking the balance between being simple enough (C-like) and abstractions (templates). Readability wise, I would have preferred Julia as that would have been the cleanest explanation of the code through the different topics at sufficient level of detail. Alas, it seems to have stalled somewhat (relatively speaking). It also doesn't help that every Rust article has Rust advocates jumping on you with "WAHT ABOUT SAFETY" which is off-putting since not every program needs to deal with all the complexity of the Rust memory/lifetime model to enable safety (simple bounds checking etc. might be sufficient).
The article is about how to squeeze absolute maximum in performance from the hardware. So you need a language that allows a relatively low-level access to hardware. And big plus of Rust is that it has standard SIMD libraries while C++ will get them only in C++26
One dimension that is not explored is partitioning the queries in batches. The primary cost is doing lookups on the out-of-cache table, so if you have a sufficiently large amount of queries you can resolve a couple layers of the tree in one step while those layers are in cache, grouping them based on where they land deeper in the tree, and then resolving all those queries that touch the same deeper part of the tree in one batch as well.
In theory, with an infinitely large amount of input queries you will never have to do an out-of-cache lookup, it can be completely amortized away into linear scans.
That said, you now end up with a bunch of results that need to be put back into the correct output order, which likely makes it not worth it. But if the operation can be fused into a reduction (e.g. find the sum of the binary search results) or the output order does not matter in some other way then all of a sudden it might make sense.
I implemented this method in Dyalog 18.0 with BlockQuicksort-like partitioning, using vectorized comparison with bit-boolean output. It's faster than you'd expect, maybe twice as slow as regular binary search when searched values are in cache, and better once they fall out of L2. But Dyalog reverted to 17.1 for later versions so you won't see it in a new download. It'll probably make it into CBQN eventually, perhaps with radix partitioning. Note that both quicksort and radix partitioning can be done and undone in a cache-friendly way.
Unlike quicksort, there's no issue of pivot selection since you always choose the midpoint of the searched values. However, there's a complementary issue of memory if the partitions become unbalanced, because the larger partition can require saved memory of roughly the depth times the number of elements. With a bit per comparison it's bounded by the size of the input.
Nice overview of sorting methods, thanks for sharing!
I also looked a bit into radix and distribution sort at some point over the past year, but in the end high performance sorting is actually too big of a thing to just do quickly on the side, as your post well shows :")
In fact I wasn't aware of the associativity issues for radix sort. That's definitely something to keep in mind and investigate.
Will definitely refer back to it once I'm looking at sorting again in more detail at some point!
I got stuck on sorting too, was working on SingeliSort (https://github.com/mlochbaum/SingeliSort) for a while. The basic performance is there but I need to get serious about testing before using it. But the radix sort and counting sort should be very solid. The approach is about the same as the C code currently used in CBQN, linked below. The main complication is to reduce constant overhead for shorter arrays with a small count type and better prefix sums, interleaved SWAR for SingeliSort since it targets generic architecture and shared SIMD utilities in CBQN.
Email in my Github profile, feel free to contact me any time if you'd like to talk about algorithms!
There are theory papers on "buffer trees"---B-trees where each node is augmented with an O(B)-length array of pending updates and queries. I believe there were also some attempts at building SQL databases based on them. It sounds like you're reaching for the same trick.
that's a hybrid compacting data structure: compacting within the btree node, normal btree topology otherwise.
And it works much better than pure compacting (i.e. the leveldb lineage), because you avoid lock contention at the root on multithreaded update workloads, and the compaction/resort is much lower overhead when it fits in l2.
incidentally, there's a filesystem that uses this technique.
Interesting stuff, definitely the kind of real world optimization that happens when you’re able to look at actual access characteristics rather highly abstracted models.
At one level you could just be saving the results into a hashtable to bubble them out again, but at the extreme end the lookup is actually spread across multiple machines in a cluster, so it’s more like throwing the query to the proper one with the right chunk of the index, along with instructions about where the final RPC is supposed to go with the result.
There was some work a while back on streaming data past queries, but you need a fairly bounded data set for that to work. Having ten years of historical data in a data set would gum that up severely.
It enhances the throughput (on average everyone waits less) at the price of a higher max latency (some, who posted a request mobilizing a very-recently-out-of-cache-index, will wait way, way more...), isn't it? In the real world those worst cases quite often kill such optimization.
The (data size / cache size) ratio and queries local (in time) dispersion are key.
Ah yes, I did consider sorting the queries at some point but I guess I then forgot about it again. If the queries are random and much less than the input size, probably most queries will hit different cache lines in the final layer, and I suspect benefits will be limited at maybe 2x best case since the last layer is where the bottleneck is.
Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)
high-performance sorting algos do either merging or partitioning. I.e., you merge R input streams into one, or split one input stream into R (for quick, radix and sample sort).
1. For merge sort of N elements, you have to perform log(N)/log(R) passes
2. For sample sort - C*log(N)/log(R), where С depends on the distribution of your data, there are no strict guarantees
3. For radix sort of K-byte elements exactly K passes (indeed, 256 ways is optimal according to my tests), which is usually larger than the previous value
While Mergesort looks preferable since it uses the least and fixed amount of passes, the merging code itself is less efficient - there are not much independent CPU operations, making it slower than samplesort and especially radix sort for small inputs.
So, it seems that the best strategy combines two different levels - for larger blocks you absolutely need to minimize the amount of passes [over memory] and employ multithreading, while for smaller blocks we need to minimize the amount of CPU operations and increase ILP.
Today two well-recognized algos are IPS4O for larger blocks and Google's SIMD QuickSort for smaller ones.
Amazingly thorough ! I love how the author leaves no stone unturned. I had no idea you could do the kind of low level efficiency shaving in Rust.
I wonder how a C++ implementation with https://github.com/jfalcou/eve would compare.
Thanks!
It's somewhat tiring to not have loose ends, but I agree it pays off :)
Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.
Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.
You kind of lost me towards the end, so I’m not sure whether you attempted it, but I was wondering whether it would be possible for the internal nodes to store only the upper 8/16 bits (that are nonzero for the current subtree). This implies that one 64 byte cache line stores 32 or 64 entries instead of 16 (or better: 31 or 62/63, since u may need some bookkeeping dat).
The next level would be to keep track of how many prefix bits are implied by the parents, so leaf nodes could perhaps also only use 8/12/16 bits, if the the higher bits are implied by the parent. or instead of bit masks, use offsets (i.e. leaves store k bits offset from parent value).
That may screw around with the branch predictor, and may work very good with evenly distributed data vs uneven data.
Ohh yes some good ideas there! I've thought about pretty much exactly this at some point but then didn't end up including it. But I do think it's quite promising, and most of the bookkeeping can be made to be branchless.
On the other hand, most of the latency is in the last few layers, and probably there isn't as much to be saved there.
The biggest problem might be that the bottom layer will anyway have to store full-width numbers, since we must be sure to have the low-order bits somewhere in case they weren't covered yet in earlier layers. Or we could have variable width encoding per node maybe (instead of per layer) but that does sound a bit iffy on the branch predictor.
In the end I guess I kinda like the simplicity and hence reliability of the 'just do the full tree and the first layers are cheap anyway' approach. Probably another factor 2 speedup is possible on specific datasets, but anything beyond this may not be reliably good on worst case inputs (like is the issue for the prefix partitioning methods).
The lowest level could be encoded with whatever number of bits is required for the numeric distance between the leafs. With proper book keeping, the full 32 bit values don’t need to be stored. The value for each node should be something like (parent << shift + node), where node has ideally 8 Bits, or maybe 16 bits.
It kind of comes down to how to deal with bad distribution of values, like large differences between adjacent values. One could for example deal with this by deliberately leaving „gaps“ in the tree, like using a 59-ary node on places (make the last values in the array large, so they will never get visited). With dynamic programming, perhaps an optimal distribution of the leaf level could be had - although it requires more bookkeeping bits of one needs to have the index of the found value, not just whether the value is in the tree.
The question could also be whether it could be interesting to store delta valeues, so that 63 8-bit valeues gives a larger numeric range - this means computing some sort of cumulative sum on on the 63 values, not sure whether there is a simd instruction for that.
One more thought: if there is different ways different layers of the tree are stored, but they are always the same for each layer, the code be unrolled, with every layer having different code to deal with it.
Last thought: if the index is not important to know, just whether the value is in the set or not, one could use a bitmap as a data structure. It would require 256MB for the 32bit space, but large spans of zeros imply empty pages.
There are stones still unturned. For typical unicode table lookup's you'd need batches for all characters of a word. It would be much faster to lookup eg 16 chars at once, to optimize cache misses.
But when I tested this even Eytzinger was too slow.
indeed, highway is the popularity leader, it implements lots of SIMD operations and supports lots of hardware including SVE/RVV with runtime SIMD width, although IMHO it has a bit longer learning curve
Nifty thing about eytzinger trees is that there's a formula for converting from a node index to its position in an inorder traversal.
This is useful if your primary keystore is a normal sorted set of keys - easier to work with - you then don't need to store explicit pointers.
Also, he didn't mention that with eytzinger you can prefetch multiple loop iterations in advance - use 1-based indexing so that sibling nodes line up nicely on cachelines.
You're right. While I wasn't very explicit about this (because algorithmica and the linked paper spend plenty of words on it already), the code snippet for Eytzinger does indeed do both the prefetching and the formula.
In fact, I'm pretty sure a similar formula could be made to work for higher branching factors, although it would surely be slower. (Probably it depends on the number of times 17 divides the final index it so, which is not great, but with B=15 it would be the number of factors of 16 which is easy again.)
The more annoying issue with not storing everything in the final layer is that we have to keep a running-answer for each query as we go down the tree, which I suspect will add measurable runtime overhead. But maybe that's a tradeoff one is willing to make to avoid the 6.25% overhead.
this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.
it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).
smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.
the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.
Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")
I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small.
That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).
The number of unique int32 values is not that great, and a full bitset would only be 512MB. Hard to bit a dense bitset in performance.
As a general purpose data structure, I would look at roaring bitmaps for this purpose, which has good trade-offs with great performance for most use-cases.
If only simple lookups are needed over a static set, then it's worth looking at minimal perfect hash functions (https://sux.di.unimi.it).
Hmm, bitmaps is an interesting idea!
If the data is dense enough, then yeah I guess a quick linear scan would work.
There's also the extreme of simply storing the answer for each possible query as a u32 and just index the array, but there the overhead is much larger.
Rank-select is also interesting, but I doubt it comes anywhere close in performance.
I happen to also be working on a minimal perfect hashing project that has way higher throughput than other methods (mostly because batching), see https://curiouscoding.nl/posts/ptrhash-paper/ and the first post linked from there.
Nice work! I love the care and discussion around low level optimizations, as such things are often ignored.
There are a lot of interesting variants of rank/select and other succinct data structures which has interesting tradeoffs. Maybe most useful when compression is critical. At least my own data structures can’t compete with the query performance you are showing, but they are great when the memory footprint matters.
I'm not up to that, but I have a working S-Tree builder and basic AVX2-using `lower-bound` for it (as described in the Algorithmica article linked to in the post) up and running in SBCL. Haven't played with any of the fancier optimizations yet, much less done any benchmarking. Should put it up in a gist or paste site to share...
This is really interesting and a thorough write up. Thanks to the author for sharing their work.
Whenever I read about super low-level optimisation, my immediate feeling is that of gratitude to the author for spending so much time shaving off nanoseconds which the entire SE community gets to enjoy.
I wonder how much time humanity has collectively saved simply as a result of how all these optimisations stack up.
Assuming this is going to save someone 10ns per query in the end and I spent 150h on coding and writing this up, we only(?) need around 10^14 queries to make it worth it!
But yes, as new technologies stack up, we achieve exponential growth in throughput in the end, which usually enables new science :)
Interesting. If you need to support occasional writes, you could have a large static search tree and a small writable tree that you check afterwards. The when you have enough changes, you could occasionally ship those changes into a new version of the static tree
Recently I tried to optimize a set implementation and found that interpolation search works quite well, being competitive with HashSet in C# (both single digit ns for my use case), with zero memory overhead. Unfortunately, it only works well with uniform data, so I guess I'll give static search trees a try. This explanation was clear and extremelly well detailed.
At some point I was playing around with interpolation search on the human genome dataset, and it was really really terrible.
It had some very bad worst case behaviour when the input data has big plateaus and you're searching for something around the edges of the plateau. This can be fixed by only using interpolation for the first few iterations and doing binary search for as long as needed afterwards, but that kills a lot of the predictability of the current approach. So while I really like it in theory, in practice I'm not quite convinced it can be made to work efficiently.
There are tricks: if you need to find uniformly distributed data it is unbeatable: guaranteed to work in 5 hops or less. After 5 hops, is better to use binary search.
If data is not uniformly distributed one trick is to sort by hash of the element (store the hash with the element or calculate on the fly).
But yeah, without those adjustments it has a worst case of O(n).
Given the fact that the keys are fixed-size integers with a not-too-terrible distribution, I would consider chasing the partitioning idea a bit harder, and possibly even removing all the fancy S-tree parts. Some ideas:
For 32 bits, using the first few or even a lot (16? 20?) bits to index an array and then chasing down the remaining bits somehow (adaptively depending on size of that partition?) seems like it could work.
In this direction, and with really impressive asymptotic performance, there’s the van Emde Boas tree and its variants.
It wasn't only the one graph - there's another graph that has a whole spectrum of blue colors - nearly impossible to tell which is which without going in with an eye drop tool.
I found this to really detract from an otherwise interesting post.
You're absolutely right. At some point I spent so long working on the plotting code that I really couldn't be bothered anymore to make it prettier. Hopefully I eventually find some motivation to fix this.
Nice post. Would the larger amount of code result in different performance in a scenario where other code is being run as well, or would the instruction cache be large enough to make this a non-issue?
Jup. I've added to the future work section now that the plan is to use this to speed up suffix array searching. Suffix arrays are pretty regularly used to build indices over say a human genome, that can then be queried.
And since DNA sequencers are still increasing there throughput and accuracy, it's always good to have faster algorithms as well.
I'm using a variant of static search trees in my search engine index though in this scenario the main bottleneck is block reads from disk, but the thinking is relatively similar to cache lines.
Use case is to query hundreds of gigabytes of data in tens of milliseconds.
Is binary search on su ch large integer data sets common? I guess maybe time series data. But I would think that this isn’t amenable to representing strings?
Definitely! If the integers are (truncated) hashes, ranges are completely meaningless.
But what would a range over strings even represent?
Perhaps if you e.g. use string indices into an array of sorted strings (by length, by prefix or whatever), you could use it for something. Depends on the application I guess :)
Range queries over strings are extremely common. For example, databases like cockroachdb and yuggabytedb use the table name as a prefix which then requires a range query to narrow down queries against a specific table.
Thanks! I did spend some time fiddling with CSS to make the yellow highlights so that the titles stand out a bit more, and to improve the code blocks, but otherwise it's a relatively common theme for Hugo. (I think it's linked in the footer.)
The final goal is to index DNA, say a human genome, or a bunch of them, and this is static data.
Then as new DNA comes in (eg is read by a DNA sequencer) we can efficiently query against the index built on the reference genome, and this reference is often fixed.
What a weird thing to complain about. The Rust code in TFA are purely arithmetic and imperative function calls. Any C developer that is good enough to care about the algorithm mentioned will have no issue reading such Rust code, heck, they could even throw the code to chatgpt to translate for them no problem.
Such a good and detailed article that packed with techniques that are applicable everywhere yet the complaints are: “but rust”.
Regarding Python, how could such optimizations be implemented in Python? Generating LLVM bytecodes directly aside.
Fluent in C and C++ among others, know nothing about Rust. It's mostly easy to follow, but I'm having trouble with notation like `&[u32]` and `&[u32; P]`. Arrays that hold unsigned 32 bit integers? But there's `Vec<u32>` which also seems like an array type? Is it like the difference between C++ primitive arrays and std::vector?
For `&[u32]` and `&[u32; P]`, those are both references to an array of unsigned 32 but integers, however the former is to an array of any size (and carries its size at runtime), whereas the latter is to an array with an explicitly known size at compile time. It's unusual in C, I think you can express it as `int[24] *` (or possibly some other permutation, I'm away from a compiler).
Vec<u32> is pretty much like std::vector compared to them, yeah.
I have not studied Rust and find its syntax alien. It looks like a cross between a functional language and C++. Trying to read it is like forcing myself to read Italian. I don’t know Italian, but have studied Spanish and Latin. I can sort of figure things out if I put in effort, but there is always uncertainty when I do. That is the problem with reading Rust.
This could be addressed if I just learned Rust (much like my difficulty with Italian could be addressed by studying Italian). However, I already know at least 10 programming languages. I am unwilling to learn the latest fashionable language unless it can promise me that the language will not change and there will never be another fashionable language for me to learn. Given that Rust rejected the standardization process that C and C++ embraced, and Zig is expected to replace Rust one day, I doubt Rust can make such assurances.
As for Python, it is often used for teaching CS and everything except the assembly code is doable in it. Using low level assembly from Python would require using the FFI. Also, Python has nothing to do with LLVM.
Suck to be you, I guess? If you don’t care enough then you don’t care. You wouldn’t bitch when there’s an article written in Italian with good knowledge inside: “but Italian”. You either trying to read using translation tool or ignore it entirely.
I didn’t say python has anything to do with LLVM, it’s just a technique, read about it, or not.
Zig replacing Rust is not my prediction. It is what I am hearing from the Zig enthusiast I know. I have also seen others express the sentiment that the next fashionable language after Rust will be Zig. Regardless of whether it is Zig, there will be another. Rust is just one part of a never ending cycle of people trying to reinvent programming.
That is my complaint against the next fashionable language. Once you have some level of proficiency in around 10 languages, adding more to the list is a chore. It lacks the enjoyment that came from the first few. At least, that is my experience. That is why I am resistant to learning new languages unless software I want to modify is already written in it. Then, I am forced to learn and time spent on it is time that actually needed to be spent, unlike my brief time spent teaching myself FORTRAN in college, which was largely a waste of time.
Yeah, don’t you think they might be perhaps just slightly biased? I have some news for you about your average technology enthusiast…
Anyone who claims Zig is going to replace Rust knows nothing about either Zig, Rust, or both. Zig is designed to solve none of the problems that Rust solves. Trillion-dollar corporations are rewriting things in Rust because it solves problems that really matter to them.
The US government strongly recommends writing all new mission-critical code in memory-safe languages. Guess which one of Rust and Zig is memory-safe?
I have nothing against Zig, it looks like a very nice better C, and the comptime stuff is both cool and useful. But its priorities are not Rust’s priorities.
> Rust is just one part of a never ending cycle of people trying to reinvent programming.
Which is good. Programming is an open field, ideas are constantly popping in and out and you sometimes need to reinvent things to prove that they are worth it. More languages is a good thing, the bizarre tribalism around them is dangerous, but it is not possible to prevent it, people gotta be people.
> Regarding Python, how could such optimizations be implemented in Python? Generating LLVM bytecodes directly aside.
You don't. You instead use your favorite Python extension system (regular Python modules, Cython, Numba, whatever) to implement this tree algorithm and expose it to Python already shaped into a container.
I don't think that Python would be the right language for such low-level performance maxxing endeavor. I would have picked C++ but t was eye opening for me to see how rust enabled such low level optimization, so I'm grateful for the choice.
C would be useful to the broadest audience. C++ programmers can read C while C programmers cannot always read C++, especially when the newer language constructs are used. I mentioned Python because of its popularity.
Interesting, the Algorithmica article the author cited is in C:
If you want to be pedantic, there is a constexpr in the snippets, wont be accepted by the C compiler. More broadly though it is a C++ tutorial. I insisted on it because the author of algorithmica actually explains he chose C++ and perhaps sometime later C/Rust or something might be better. Should respect his POV.
You have to be 1/3 down the page to see any of this in the code examples. By that point, you would already have seen a number of examples that are valid C code and it is reasonable to expect the rest to be. At least, that was my experience.
As far as I can tell, the community is largely divided into three major groups. Those that can read C, those that can read Python and those that can read both. Using either of them would have dodged criticism that your code examples are not accessible to much of the community.
That said, you are right that Python is not the best language to use when you need to use intrinsics, as you would be writing it in another language and using the Python FFI to access it.
I have a far easier time reading Rust than C. Probably because it’s been years since I’ve had to do C, and I find some of the syntax and patterns quite C-specific and confusing. Python also does not express some of the important implementation details well enough in its standard syntax IMO: there’s no obvious delineation between references and pass by value in Python, borrowing, etc.
So realistically, this is just the other half of the coin for all the articles where the code examples were written in C and everyone who didn’t really read C just had to put up with it.
Rust is now common enough in the industry that everyone should be expected to at least read it, just like every programmer should be able to fizzbuzz in Python and JavaScript just as a matter of general technical literacy. There's no advanced Rust here and no concepts specific to the language.
It's a good article.
If Heinlein were in our industry, he might write this:
A programmer should be able to write a shell, parse binary data, design a protocol (and reverse engineer one too), authenticate a password, not fumble multi-code-point graphemes, wrangle semaphores, plug memory leaks, compress data, migrate databases, automate builds, git rebase a tree, profile tail latency, write a simple design doc for a complex system, draw a triangle, multiply tensors, and do fizzbuzz in every popular language on GitHub. Specialization is for insects.
As the TIOBE index says “The ratings are based on the number of skilled engineers world-wide, courses and third party vendors”. Very many people do care about that. Is there any reason I should think your remark is from some legitimate issue rather than you not liking the index results?
I would argue most people working software jobs either haven't heard of TIOBE or don't care about it. I only occasionally hear about it from strange internet comments that are far too focused on language particulars.
That is surprising. I heard about the TIOBE index often about 15 years ago, but now that I think about it, I have not heard about it lately. I wonder if the rise of Python and ML has caused people to stop asking “what language should I learn”. That is the question the TIOBE index often was cited to answer.
Rust is where the action is though. It's not just about cumulative SLOC weight, but about mindshare among the "movers and shakers" of the industry. Think of it as a prestige language, an acrolect, you should at least vaguely know so that you can participate in the most challenging conversations in the field.
That is an interesting perspective. Here is my perspective. There are no popular projects that use Rust as a primary language. The current fashionable area of computers is machine learning. As far as high level languages go, you will see a mix of Python, C++ and perhaps a tiny bit of C there. Rust is nowhere to be seen.
Your description would be better applied to C, C++ and Python. That is what the majority of popular software uses and people having challenging conversations are often using one of those languages.
> There are no popular projects that use Rust as a primary language.
Servo is pretty notable and visible. ripgrep, influxdb, wasmer, deno. uv and ruff in the python ecosystem are written in rust.
AWS, Cloudflare, Discord (iirc), and Microsoft have all adopted Rust extensively.
As of late 2023 there are committed drivers in the Linux kernel written in rust. (Linux is pretty popular.) prior to those, only C and assembly were allowed.
> There are no popular projects that use Rust as a primary language
You may be surprised to learn, Python's `cryptography` project is written (EDIT: maybe not primarily, it's hard to compare given dependencies) in Rust, and saw around 8.1 million downloads from PyPI yesterday. That puts it close to the top 20 (the 20th project comes in at 8.5m).
If you know C/C++ you could have learned to read Rust in the time you spent commenting in this thread. The number of new syntaxes you need to learn is quite minimal.
The ACM for a time only accepted code examples in ALGOL because it was felt that there should be a standard language that everyone understood. ALGOL eventually gave way to C and others, but there is definite value in ensuring things are accessible to the largest possible audience.
You're free to use ALGOL in your posts if you want. It reads like pseudocode used in algorithmic books, and should be familiar to anyone who used the Pascal family.
ALGOL gave way to C, and C is now giving way to Rust, Zig and others. You're playing the role of the people who wanted to keep the example snippets in ALGOL.
usaco explicitly nudges you towards C++ on the first page itself. problems are graded on running time, even with moderately large datasets, Python simply doesn’t make the cut. So most high schoolers end up using C++ for usaco, and Java for AP Comp Science.
This is someone’s blog, not an ACM publication. Why hold it to the same standard? They can write in whatever they like. I have a hard time reading Haskell, but I don’t sit and kvetch about it anytime someone wants to use it in a blog post.
Erm, the concepts that you're referring to are independent of the language being used. The author is free to use whatever language they want. The GP (I think) was merely trying to make a point about making the article accessible to a wider audience that is capable of appreciating the concepts involved.
If you can't understand the code snippets in this post, the concepts will most likely be lost on you, no matter what other language would be used for the code.
Isn't that precisely the point that the GP was trying to make, about the accessibility of the article?
Edit: Eytzinger layout, Cache lines, SIMD, et cetera are independent of the particular language used in the article. They're just as valid in C, for example. I don't understand your point about them being tied to Rust.
I wonder what percentage of Rust users have significant experience with C. The difficulty curve of Rust was too high for me when I was coming from Python, but after doing HolyC and Zig, I find it fairly straightforward to conceptualize, though still difficult.
I've been working with real C today and have been having fun, but it's very different for me to step back into the C world again after working with more modern tools.
google has a mature C++ library for portable SIMD. The original article seems to be a translation of the excellent algorithmica site which had it in C++.
I've never programmed in rust and i can understand find.
At the end of the day rust is just another imperative programming language. You shouldn't need to know the language at all to understand the very simple examples written in rust.
The problem with pseudocode is that it's completely underspecified. And how would I ever write intrinsics in pseudocode? Much easier to do a proper language directly so people can actually Google things and read their official documentation.
My former roommate works in a similar domain; https://dl.acm.org/doi/pdf/10.1145/3448016.3452841 is an example of a paper implementing intrinsics in pseudocode. They "unroll" the intrinsics with a comment saying that this is implemented with an intrinsic. Of course though, your blog isn't a paper, don't know why people are getting up in arms about it.
Also the other comment saying that "psuedocode is not concerned with intrinsics" is false. You can get "great" theoretical speedups (the shaveoffs are tiny but hey, they're better) with "simple" intrinsic operations - that's my roommate's entire research lol. The external memory model, for example, formalizes caching and allows all these low level optimizations to flourish. I'm not sure how intrinsics tied into it, but he's published so I'm not gonna question it :)
---
Speaking of which, I noticed that you did competitive programming. How does CP compare to research? I loved data structure manipulation problems in CP when they were clever - often because they involved noticing that you can take a "common" model, but then optimize it significantly because you only needed to support a subset of the operations through a clever mathematical proof based on the structure of the problem - but as I got to the higher levels it felt more and more that a lot of them became really obscure "support 413 operations on a tree, and yes, you really need to support all 413 operations on a tree" and that's kind of my opinion of data structure research unfortunately as well :( I guess because solving broad general instances is more important. I'd love to hear your perspective though.
Yeah exactly, pseudocode just doesn't cut it if what matters is exactly the implementation itself.
Indeed I did a bunch of competitive programming! But actually there my favourite topics are combinatorics, graph theory, and number theory. I'd usually leave the datastructure (read segtree) problems to my teammates.
I super enjoyed that, and indeed was looking for a PhD where I could do similar things (because my time at Google was boring in comparison -- mostly just software engineering), on the intersection of new theory and practical fast code. I decided on bioinformatics, because this is exactly a field that has a lot of data, and the amount of data is growing fast, so that fast algorithms&code are needed, both in theory (big-O) and practice. Generally I've been super excited working on various problems in this domain, and I'd say it quite closely matches my compprog experience :)
That's unrealistic given that the author is not just doing computer science but also engineering using SIMD. The kind of available SIMD instructions actually affects code choice.
I strongly prefer actual implementations and adore the way this was written. Maybe its all of the "to get this to run in O(N) time we will leave the trick for the reader to discover" I've bumped into. Or the classic "use this esoteric pay-walled reference for instructions for this important subalgorithm" then when you get that reference you realize its doing the same sort of thing. Super cool for quizzing masters students, pretty ridiculous for people in industry or hobbyists who learn by playing with code. Unfortunately when that happens there is almost never an OSS implementation with the "trick" either.
I suspect like 9/10 of these type of articles are really just meant to be PR for rust. Like someone that wants to write Bart Simpsonensque "rust is great" over and over, but have to hide it into something more interesting.
Great stuff. We get told O notation is what matters for data structures / algorithms, but improvements low level with memory and storage with things like rust is much more where improves can be made. These types of tricks for anctual run times are so valuable, and interesting to follow.
It was great while computers were not really a thing yet, but these days it's often so meaningless. We see papers with 2x speedup with a lot of novel algorithmic stuff that sell better than 10x speedup just by exploiting CPUs to the fullest.
Even myself I kinda think theoretical contributions are cooler, and we really need to get rid of that (slightly exaggerating).
There are two ways to look at this Big O wise. One is that insertions and deletions would be asymptomatically faster since memmove() is a linear operation while bubble up/down are logarithmic operations. Look ups would not be any different asymptotically, but the constant factor might improve from being able to do prefetch. The other way is that the N is bounded, such that it is all O(1) and the difference is how big the constant factor is.
I imagine I could implement it and benchmark it. However, my intuition is that the end result have lookups be marginally faster to the point of splitting hairs while insertions and deletions would be slower. While memmove() is a technically a linear time operation, it is a sequential operation that has a very low constant factor. The bubble up and bubble down operations needed to do insertions and deletions in a Eytzinger ordered array are technically random access, which has a higher constant factor. At some point, the Eytzinger ordered array operations should win, but that point is likely well beyond the size of a b-tree node.
My reason for saying this is to say that Big O notation still matters, but understanding when the constant factor is significant is important.
Same with async and throwing threads at a problem. People love do those and think it's the right answer, but you can do a ton with smarter memory management and actually looking at what the code is doing lower level rather than abstractions.
He is improving the constant factor in big O notation. University algorithm classes tend to ignore cases where the constant factor difference is significant enough to favor a asymptomatically slower algorithm. Matrix multiplication is the quintessential example of this, since a good implementation of the O(n^3) algorithm will outperform asymptotically faster algorithms, such as the famous O(n^2 * log(n) * log(log(n))) one that uses the FFT. At least, it outperforms it on matrix multiplications people actually do in practice.
Rust is... okay. I still prefer C++'s template system to Rust generics, in part because Rust doesn't have specialization and won't for a while. Memory safety by default is a big enough win to make me overlook these nits. Really, though, most people most of the time should be writing managed code.
"Go ahead Junior, please implement this world class work and explain it to me as you go along. Oh, starting salary? No no, this is an unpaid 6 month internship."
Fun aside, asking someone to work through a problem with you is fine, but when it diverges so much from the actual work then it just becomes ridiculous.
And, well the language choice doesn't really matter for the core subject here. I could follow this post fine even though I don't really know Rust (just like I assume a Rust programmer could follow a well written algorithms deep dive with examples in C with a bit of dedication). EDIT okok that was a lie, I could follow the first half of this post, ish, and then it got way too deep for me.
But anyway, standardize on Rust, I like it. I'd love if a bit more Zig was thrown into the mix here but either of them feel like a better default than C at this point to me. After decades of C being The Standard for this stuff, I love that this is finally changing.