The other problem with simd is that in modern cpu-centric languages it often req...

aseipp · on Aug 22, 2024

The vector width stuff is overblown. There are in practice 3 widths for all desktop architectures (128-bit, 256-bit, 512-bit) and the 95th percentile of all desktops are squarely in the first two buckets, or alternatively you're targeting a specific SKU which you can optimize for. You'll have to write uarch specific code for absolute peak performance anyway. It's annoying but hardly a deal breaker for this specific task.

The bigger problem is most modern ISAs just aren't very nice to program in. Modern takes like SVE/RVV/AVX-512 are all way better here and much easier to write and think about because their instructions are much more general and have fewer edge cases.

> And it also takes a lot of space on your cpu die. Like, A LOT.

No, it doesn't (what would even be the "right" amount of die space?) But even if it did that would not make it a waste of space. Using up some amount of die space may result in huge performance gains for some subset of floating-point tasks that can't be achieved in any other way. For example software cryptography massively relies on these units, because the benefit is like multiple orders of magnitude; even if that's taking up 10% of die space for only 0.1% of software, that software has disproportionate impact -- if you took away that functional unit, then you may have a huge net decrease in efficiency overall, meaning you need more overall processors to handle the same workload from before.

janwas · on Aug 22, 2024

I generally agree with your points. On niceness of programming, our Highway SIMD library provides portable intrinsics that cover most of the nice SVE/RVV/AVX-512. Those intrinsics are reasonably emulated on other ISAs. Is it fair to say problem solved?

janwas · on Aug 22, 2024

I do not understand this take :) The vast majority of our code is vector length agnostic, as is required for SVE and RVV.

GPU perf/TCO is not always better than CPU, if you can even get enough GPUs, and the latency is also a concern and inducement to do everything on the GPU, which can be limiting.

shakow · on Aug 22, 2024

> a gpu can provide 1000%+ of perf AND a certain level of portability.

Relying GPU only makes sense in a handful of context, e.g. “computing stuff fast is my core task” or “this will be deployed on beefy workstations”. SIMD addresses all the remaining cases and will give benefits to virtually everyone using your software; I'm sure one could implement a GPU-based JSON parsing library that would blow json-simd out of the water, but I'm not going to deploy GTX1060 on my cloud machines to enjoy it.

TinkersW · on Aug 22, 2024

Basically everything you said was wrong..

   The other problem with simd is that in modern cpu-centric languages it often requires a rewrite for every vector width.

Nope, you can use many existing libraries that present the same interface for all sizes, or write your own(which is what I did).

   And for 80% of the cases by the point there is enough vectorizable data for a programmer to look into simd, a gpu can provide 1000%+ of perf AND a certain level of portability.

Transfer latency & bandwidth to GPU is horrible, just utterly horrible. And GPU to CPU perf dif is more like 5x, and in games etc the GPU is already nearly maxed out

   And it also takes a lot of space on your cpu die. Like, A LOT.

Relative to the performance it can offer it is a very small area. The gains in the article are small compared to what I see, probably the author is new to SIMD.*

adrianN · on Aug 22, 2024

Compilers are becoming reasonably good at autovectorization if you’re a bit careful how you write your code. I wouldn’t say that simd is niche. You often don’t get the really great improvements you can achieve by being clever manually, but the improvements are still very measurable in my experience.

cvadict · on Aug 22, 2024

IMHO, the overhead of perpetually babysitting compiler diagnostics or performance metrics to ensure your latest update didn't confound the auto-vectorizer is never a net positive over just using something like xsimd, Google highway, etc.

secondcoming · on Aug 22, 2024

clang may be getting better, but gcc isn't.

Being 'careful how you write your code' is putting it mildly. IME you have to hold the compiler's hand at every step. Using 'bool' instead of 'int' can make gcc give up on autovectorisation, for example.

You need to use intrinsics, or a wrapper lib around them, if you want to be sure that SIMD is being used.

whiterknight · on Aug 22, 2024

The compiler can’t fix your data layout

titzer · on Aug 22, 2024

It depends on how high-level the abstraction you're using is. Things like Halide do pretty significant reorganization of your code and data to achieve parallel speedup. And of course if you are using Tensor libraries you're doing exactly that. It all depends on the level of abstraction.

whiterknight · on Aug 23, 2024

I assume if we talking about SIMD we are not talking about python, etc.

mgaunard · on Aug 22, 2024

You can just use templates or macros to make it length-agnostic.

neonsunset · on Aug 22, 2024

> The other problem with simd is that in modern cpu-centric languages it often requires a rewrite for every vector width.

It does not: https://github.com/bepu/bepuphysics2/blob/master/BepuUtiliti... (in this case, even if you use width-specific Vector128/256/512 - compiler will unroll operations on them into respective 256x2/128x4/128x2 if the desired width is not supported. Unfortunately .NET does not deal as well with fallbacks for horizontal reductions/shuffles/etc. on such vectors but it’s easily fixable with a few helpers)

Naturally, there are other languages that offer similar experience like Zig or Swift.

Moreover, sometimes just targeting one specific width is already good enough (like 128b).

Also you can’t (easily) go to GPU for most tasks SIMD is used in CPUs today. Good luck parsing HTTP headers with that.

secondcoming · on Aug 22, 2024

It's unlikely everyones cloud machines have GPUs