Hacker News new | past | comments | ask | show | jobs | submit login
Transcoding Unicode with AVX-512: AMD Zen 4 vs. Intel Ice Lake (lemire.me)
110 points by ibobev on Jan 5, 2023 | hide | past | favorite | 67 comments



> However, we have two popular Unicode formats: UTF-8 and UTF-16

The fact we are still using UTF-16 still irks me to this day. UTF-16 (which is actually two different encodings, not one, hence the need for a BOM) is basically a way to salvage all those platforms that hurried on the UCS-2 (aka, the original "UNICODE") bandwagon in the '90s hoping that by just doing s/char/wchar_t/g all their internationalization problems would be solved.

It did not go well, to say the least. UTF-16 and 32 are objectively worse than UTF-8, because they still are multibyte encodings, you still have to do normalization, ... while also having to deal with "char16_t" and the likes (I will not enter into the whole "TCHAR" fiasco).

Spoiler alert, the world is still full of allegedly "UTF-16 compliant" platforms out there are not, indeed, UTF-16 compliant, they just use 16 bit chars and hope for the best.

The whole idea "1 char = 1 character" is arguably a terrible idea in Unicode, though. You not only can have multibyte characters, but you can also have multirune characters, where multiple codepoints are normalized into a single displayed character (just think about ` + e = è). It's a mess and it's bound to be broken, and there's no real way to "fix that up" - ASCII's assumption of "1 value = 1 char" was the broken concept here, and it unfortunately flawed how every developer (me included) thinks about strings. Sigh.


>The fact we are still using UTF-16 still irks me to this day.

Windows, unfortunately.


UCS-32 is at least directly indexable, even though it's ludicrously space-inefficient.


Only in codepoints, but it still has the problem GP mentions of ` + e = è being two codepoints (so two elements in UCS-32), but being logically one character

https://manishearth.github.io/blog/2017/01/14/stop-ascribing...


This, it's pointless to have char32_t if you still need to pull several megabytes of ICU to normalize the string first in order to remove characters spanning over multiple codepoints. UTF32 is arguably dangerous because of this, it's yet another attempt to replicate ASCII but with Unicode. The only sane encoding out there is UTF-8, and that's it. If you have to always assume your string is not really splittable without a library, you won't do dangerous stuff such as assuming `wcslen(L"menù") == 4`.


AVX-512 is wider, but also needs special instructions to leverage the hardware.

This is unlike RISC-V V extension, where the same code will run and utilize the hardware, regardless of vector unit width.


Vector length agnostic programming has its own share of problems. I'm not familiar with the RISC-V V extension, but I assume it's similar to ARM's SVE. There's a good critical look at SVE and VLA here: https://gist.github.com/zingaburga/805669eb891c820bd220418ee...


V extension and SVE2 are very different.

Here is a quite recent introduction to RISC-V Vector[0].

0. https://erikexplores.substack.com/p/grokking-risc-v-vector-p...


I'm curious why you say they are very different? From where I sit, RVV also supports mask-like predication, and adds two concepts: LMUL (in-HW unrolling of each instruction) plus the ability to limit operations to a given number of elements.

The former is nifty, though intended for single-issue machines, and the latter seems redundant because masks can also do that.


Most of what's interesting about avx512 is the new instructions; the wider vectors are just icing on the cake. You would need to rewrite your code regardless.


I wonder to what extent compilers even emit avx512 instructions apart from the common ones (load, store, shuffle, arithmetic) in case you don’t want to manually optimize for sse / avx / avx2 / avx512.


How much code is compiled with `-march=native` or function multiversioning? I would guess the percentage is relatively small, at least when it comes to distributed binaries.

Compiler autovectorizers also aren't very good at producing fast AVX512 code, so most of the benefit would probably come from using optimized libraries like Intel's MKL or simdjson.


> How much code is compiled with `-march=native`

Any installation of Gentoo is, presumably. (Otherwise, what's the point of compiling it all yourself?)

More interestingly, possibly all OEM firmware-installed copies of ChromeOS are -march=native builds as well, given that ChromeOS is based off of a Gentoo upstream.


True. I have never gone down the Gentoo rabbit hole. Might be fun to try sometime, but I'd seriously doubt that the time spent compiling would be won back from better performance.

Clear Linux is probably a more practical alternative. I used it a couple years ago, and found that they had a lot of avx2 and avx512 versions of random libraries built, with the appropriate ones presumably being loaded based on the hardware.

Random glibc math function calls, for example, were much faster on Clear Linux than Arch or Fedora. But development of Clear seems to have stopped, libraries like llvm aren't being updated anymore so the toolchains are outdated. I'd wanted to avoid the blood and sweat of managing my own toolchains, and ironically being on bleeding edge distros (Arch,Fedora,etc) was the way to keep that to a minimum. Next time I reinstall an OS, I'll look at Clear again. Or maybe Guix or Nix. Or maybe use spack for package management on top of some other distro.


When I ran Gentoo I just had the builds running in the background with a really high niceness.


Even if you use the GNU C vector extension to explicitly give the compiler ways of optimizing C, it is not very good at generating good vector code:

https://github.com/openzfs/zfs/pull/14234#issuecomment-13345...

A bug report has been filed with GCC for one of the issues. LLVM is much better here, but not perfect, or at least that has been my experience when trying to have the compiler generate assembly for an explicitly vectorized fletcher4 implementation.


Noob question- must one write avx512 assembly directly by hand, or is this something a c compiler would do for you?


Generally you are better off coding with "intrinsics", compiler extensions that represent the instructions more symbolically, if in fact the compiler offers what you need.

I am not sure the really interesting AVX-512 instructions have intrinsics yet. For those it's asm or nothing.


Potentially both. Most compilers have vectorization optimizations if you compile for an architecture that supports it.

However, a lot of software is compiled on one machine to be run on potentially many possible architectures, so they target a very lowest common denominator arch like x86-64. This will have some SIMD instructions but (I don't think) AVX-512.

So if a developer wants to ensure those instructions are used if they're supported, they'll write two code paths. one path will explicitly call the avx512 instructions with compiler intrinsics and then the other path will just use the manual code and let the compiler decide how to turn it into x86-64 safe instructions.


thanks for that! so it sounds like, if i purchase a chip that supports avx512, and run an operating system and compiler that supports avx512, i can write "plain old c code" with a minimal amount of compiler arguments and compile that code on my machine (aka not just running someone else's binary). and then the full power of avx512 is right there waiting for me? :)


A compiler turning C(++) code into SIMD instructions is called "autovectorization". In my experience this works for simple loops such as dot products (even that requires special compiler flags to enable FMA and reorders), but unfortunately the wheels often fall off for more complex code. Also, I haven't seen the compiler generate the more exotic instructions.


You should use Intel intrinsics - generally, they are supported by all compilers.

E.g. https://www.intel.com/content/www/us/en/develop/documentatio...


if you are targeting more than one specific platform, do you like, include the immintrin.h header and use #ifdef to conditionally use avx512 if it's available on someone's platform?


It would be simpler to use the portable intrinsics from github.com/google/highway (disclosure: I am the main author). You include a header, and use the same functions on all platforms; the library provides wrapper functions which boil down to the platform's intrinsics.


Well, there is SIMD proposal for C++23 with kind-of-reference implementation. But I don't know how well it works for AVX0512


From what I have seen, this is unfortunately not very useful: it mainly only includes operations that the compiler is often able to autovectorize anyway (simple arithmetic). Support for anything more interesting such as swizzles seems nonexistent. Also, last I checked, this was only available on GCC 11+; has that changed?


I think proposed Vc lib is tested under clang as well.


Here is my source: https://github.com/VcDevel/std-simd

Ah, but this repo mentions that the GCC 11 implementation apparently also works with clang: https://github.com/VcDevel/Vc. Thanks!


I wonder how much compilers could be improved with AI?

I'd imagine outputting optimized avx code from an existing C for() loop would be much easier than going from a "write me a python code that..." prompt.


typically if it's available, compilers will use the avx512 register file. This means you'll see things like xmm25 and ymm25 (128 and 256 bit registers) and those are avx512 only. However, compilers using 512-wide instructions is kinda rare from what I've seen


You can use `-mprefer-vector-width=512` to use 512 bit vectors, or if you want a particular function to use 512, you could try the min-vector-width attribute: https://clang.llvm.org/docs/AttributeReference.html#min-vect...

In my experience, clang unrolls too much, so you end up spending all your time in the non-vectorized remainder. Using smaller vectors cuts the size of the non-vectorized remainders in half, so smaller vectors often give better performance for that reason. (Unrolling less could have the same effect while decreasing code size, but alas)


so then, if i want my code to "explicitly" use avx512, i have to do something like this?

``` void myNotOptimizedThing(my_data* d){ _SPECIAL_CPU_MANUFACTURER_0X3D512(d); } ```

edit: and include some header from the manufacturer most likely?


without using intrinsics? `-O3 -march=skylake-avx512 -mprefer-vector-width=512`


Using the RISC-V V vector instructions means that the underlying hardware vector width can change and the code will automatically take advantage of the larger width.

That said, many of the avx512 instructions are simply extended width AVX2/avx2 instructions. The interesting things about it are really the increased width and the additional registers. Not many of the new instructions that are not bitwidtg extended versions of the old ones are particularly interesting since Intel had already implemented most of the interesting things for smaller vector widths.


I've only scratched the surface of the avx512 instructions, but they are much more broad and useful. Masked gather, scatter, double precision exp and mantissa extraction, and floating point to integer conversions are all new and all proving useful to me.


Vectorizing text handling is likely to use generalized register permutes (and I spy some _mm512_shuffle_epi8 here), which are the bugaboo in length-agnostic SIMD. Fundamentally, the maximum index you can read from in a register depends on the register size.

So yeah, even in RISC-V V, vrgather has explicitly different per-element operation depending on VLMAX, which obviously depends on the HW's VLEN. So depending on the table size, you have to assume constraints on VLEN or execute different permute sequences.


Is there a RISC-V chip with SIMD available for purchase with roughly comparable price/performance to current Intel/AMD offerings?


If you specifically prefer SIMD over Vector, Andes has offerings with draft P extension.

If you otherwise want vector (V extension), "right now" would limit you to pre-1.0 V extension implementations.

If you need to license hardware IP, there are several very high performance implementations as of RISC-V Summit[0]. Actual hardware will pop up throughout 2023.

0. https://www.youtube.com/@RISCVInternational/videos


512bits is 64 bytes, a cache line on x86_64.


> These results suggest that AMD Zen 4 is matching Intel Ice Lake in AVX-512 performance. Given that the Zen 4 microarchitecture is the first AMD attempt at supporting AVX-512 commercially, it is a remarkable feat.

Um... Ice Lake shipped over three years ago. I mean, there's a real question as to whether or not "senselessly wide SIMD" is a good or bad feature in a datacenter part, and how or whether AMD should attempt to implement it and within which market sectors. And surely there's discussion to be had about the design tradeoffs to be made chasing after this nonsense.

But, no, performance parity with chips that are nearing end of life has to be viewed as table stakes here. It's certainly not "remarkable".


The Ice Lake chips being benchmarked against are server chips, while the 7950X Zen 4 chip used is a consumer chip. So while Ice Lake has been out for a while, it’s also several times more expensive. It’s also worth noting that it took Intel several generations of trying AVX512 to get it working well, so AMD doing it first try really is impressive (even if they did cheat by just having AVX512 be double pumped AVX2).


Double-pumped is fine: what matters is the new semantics implemented that cannot be expressed efficiently in previous ISAs.


Whether it's comparing latest gen architectures against old architectures or comparing consumer CPUs against enterprise CPUs (or an unholy combination of both), it's all insincere hogwash.

Comparing apples to oranges is not how you determine how good a peach is.


Until the Intel Sapphire Rapids server CPUs will be launched later this month, the Ice Lake/Tiger Lake/Rocket Lake microarchitecture is the best AVX-512 implementation available from Intel, after 15 years since this ISA has been publicly disclosed.

Those differ only in the clock frequency and in a few details that are irrelevant for this particular benchmark.

So the comparison normalized by clock frequency, as done here, is legit, no other better comparison is possible for now.

Even if someone had a Sapphire Rapids sample, they would not be allowed to publish any benchmark yet.


Intel also decided to disable AVX-512 on their consumer CPUs going forward, presumably as long as their P+E core strategy remains in place.


I'm of the understanding AVX-512 is available on Alder Lake and up with an appropriate BIOS and the E cores (if applicable) disabled.

I never looked into it in detail, so I could be mistaken.


Only on older CPUs, they started fusing off AVX512 on newer silicon batches (even on 12th gen)


Is there a source for a decision having been made for _all_ their consumer CPUs, not just ADL?


Wonder if this piece by Linus about AVX512 is still relevant https://news.ycombinator.com/item?id=23809335


AVX-512 one major problem and a bunch of minor ones. The major problem is that most computers still don't have it. Intel tried to segment their lineup and only put AVX-512 in their high end server CPUs for the first 2 generations that had it, but as a result normal programmers didn't have access to it, compiler devs didn't have access to it, and users didn't have access to it. As a result, most compilers don't do a good job generating AVX-512 code, and most programmers think AVX-512 isn't useful.

AVX-512 is great. The new instructions are incredibly useful for a wide variety of applications, but the fragmentation and segmentation by Intel has made it a total mess.


Eh, intel, reminds me how they arbitrarily fused virtualization on like half of desktop CPUs.

But hey this time it literally gave AMD time to catch up...


Hopefully it's available in AMD CPUs from now on. I already got it with 7950X and I am looking forward to trying it out!


It’s not, as long as you’re using a recent CPU architecture. I think the slowdown problem was mostly related to Intel’s first version of AVX512, and the underlying issue has since been addressed (at least to the point that it’s not nearly as much a problem as it used to be).

This is also why it’s impressive that this is only AMD’s first attempt: it appears to work really well, where it took Intel multiple attempts to get it working well.


Before we consider the technical merits, let us note that Linus admits to "irrational hatred" and "bias" on this topic. It's also not clear to me how much experience he has developing and testing AVX-512.

As others mentioned, throttling is basically nonexistent on Icelake (and AMD Genoa). It can hurt on Xeon Silver (so let's not use those?) and if you only sporadically use SIMD instructions (again, don't do that).

I claim that just about any reasonable code which sustains SIMD instructions over several milliseconds would still be a net win even with throttling.

Un-nuanced concerns about throttling are outdated and unhelpful. Perhaps I'll write up a paper on this.


The hatred is probably not that "irrational". We live in an era where specialized hardware for specialized problems is required because new manufacturing processes may give us a (seemingly slowing down) increase of transistor budget but not really better switching frequencies. We will have units for matrix multiplication, video codecs, AI cores or full blown GPUs. All those units can only be fast in specialized hardware because of predictable memory access patterns and arranging memory/cache topology accordingly "solving" the problem of low switching frequency with high bandwidth. A general purpose CPU however should specialize on unpredictable memory access. This means AVX-512 is somewhat misplaced on a CPU and probably only exists because it served Intel to create nice numbers in irrelevant benchmarks.


The hatred is specifically directed at "FP" which I understand to be floating-point. Makes sense inside an OS kernel but a large majority of HPC would indeed consider this irrational.

I understand that dark silicon is helpful, but am not so sure that fixed-function HW is the way to go. Perhaps video _de_coding is the most convincing from your list; codec generations are 5+ years, so enough time to benefit from HW. Encoding, on the other hand, tends not to be impressive unless perhaps there is also a software component.

For the rest, programmability and deployability (can we rely on it?) is a major issue. Software has often been the limiting factor.

Another big concern is the 'hardware lottery'. The algorithms we develop and get are selected for, and tuned to, the current hardware. Perhaps this gets us 5x energy efficiency vs CPU/SIMD. But by painting ourselves ever further into the corner of dense linear algebra, which is definitely not the way that nature implements intelligence, we are missing out on far larger opportunities. For example: spiking nets or memristors have the potential to be 2 or 3 orders of magnitude better. Or actual sparsity, not the fixed-pattern thing (now that is a prime example of an irrelevent benchmark, because AFAIK algorithms haven't yet been able to use them well).

> A general purpose CPU however should specialize on unpredictable memory access.

Should it really? I think rather we should avoid such accesses whenever possible, because their energy cost now dwarfs that of computation.

> This means AVX-512 is somewhat misplaced on a CPU and probably only exists because it served Intel to create nice numbers in irrelevant benchmarks.

I have difficulty understanding how a reasonable person can come to such a conclusion. Lemire (the author linked here) has a long series of results showing nice speedups from AVX-512. I personally have seen gains in image compression, string processing, cryptography, linear algebra, integer coding, hash tables, databases, sorting, and compression.

[Opinions are my own.]


> I have difficulty understanding how a reasonable person can come to such a conclusion.

The applications are very niche. Compilers are usually not smart enough to utilize SIMD, it is a hit or miss. And in order to implement properly efficient SIMD algorithms you need experts that are rare. Furthermore many algorithms that work great with SIMD work even better as compute shader on your run of the mill cheap iGPU.

The application of this article is the best example how irrelevant SIMD really is: How many Terabytes of UTF8 are you converting to UTF16 per day? probably zero.


What leads you to think the list of applications I enumerated is 'niche'?

> in order to implement properly efficient SIMD algorithms you need experts that are rare

Some truth to this, but many algorithms can be implemented once and then reused, like a standard library.

> many algorithms that work great with SIMD work even better as compute shader on your run of the mill cheap iGPU

Also agree to some extent, except that you'd have more concerns about availability, vendor lock-in, and performance portability.

> best example how irrelevant SIMD really is: How many Terabytes of UTF8 are you converting to UTF16 per day? probably zero.

First, how does one example of a SIMD-enabled algorithm show that SIMD itself is irrelevant? Second, have you considered that some databases store UTF-16 and want to convert it for interoperability (or vice versa)? IBM apparently has dedicated instructions for this. Would they have been added if there was no demand?


I saw this once before, and both times, it's pretty shocking. Is this really something that needs to be inside the CPU itself? I don't want my CPU doing this. I would rather just take the performance hit and keep the CPU "dumb".


Are you saying you don't want your CPU to have SIMD/vector capabilities, or you want it to have a limited SIMD instruction set without the extra flexibility that AVX-512 brings, or have you wildly misinterpreted the headline (twice?) to assume that AVX-512 adds special-purpose instructions for Unicode conversions?


Any instructions beyond NAND are clearly bloat.


I think the parent must not understand what SIMD/vector operations are, and probably thinks there is native Unicode support in the CPU. At least, that’s my most favorable interpretation of his critique.


> keep the CPU "dumb".

a 240+ entry reorder buffer has entered the chat.


> keep the CPU "dumb"

Users started dumb, develooers became dumb, now you want CPUs dumb too? Who will deal with the consequences??


x86 CPUs have been essentially black magic for a while now. If anything, what's described here is on the dumber side for things an x86 CPU from the past decade or two is doing behind the scenes to operate as fast as it does.


You can have "dumb" CPUs all you want, just stick with Z80, 6502, 6809. There's still plenty of fun using those and you can hold most of what they do in your head. Add 68000 and MIPS R2000 for some 32 bit fun if you wish. Anything past that level of tech uses brainy tricks to get the performance going.


This is a dumb feature for a modern CPU, the really smart part is out-of-order execution and other optimizations that CPU's do behind the scenes




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: