> OpenXLA is an open source ML compiler ecosystem co-developed by AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA. It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware
One thing I really love about XLA is GSPMD which effectively allows scalable distributed training in practice. However, I was quite curious how it is related to matrix multiplication though, given XLA is more focusing on graph-level optimization and basically offloads matmul to other libraries like Triton and cuBLAS
I used to think this. And I think, in theory, it is true. But the fact of the matter is, modern ML just doesn't use that many kernels. Every framework uses the same libraries (BLAS) and every library uses the same basic idea (maximally saturate FMA-like units).
Large language models are being run natively on commodity hardware with code written from scratch within days of their release (e.g. llama.cpp).
From a conceptual standpoint, it's really easy to saturate hardware in this domain. It's been pretty easy since 2014 when convolutions were interpreted as matrix multiplications. Sure, the actual implementations can be tricky, but a single engineer (trained in it) can get that done for a specific hardware in a couple months.
Of course, the interesting problem is how to generalize kernel generation. I spent years working with folks trying to do just that. But, in retrospect, the actual value add from a system that does all this for you is quite low. It's a realization I've been struggling to accept :'(
We....do use kernels and kernel generation in the ML field. Every day. I'm confused as to a few of the points made here, unless the first sentence is meaning '...doesn't use that many hand-written kernels'.
All of the convolutions we have run on kernels, some pre-built and customized/chosen from a list based on performance, and some dynamically generated. PyTorch 2.0 for example decomposes and fuses operations, then uses OpenAI's Triton to dynamically generate a custom fused kernel that tends to be very efficient.
There are still hand-written kernels, even, Flash-Attention and the Memory-Efficient attention papers both caused huge leaps forward because they manually went through a lot of the inefficiencies of naive matrix multiplies for attention w.r.t. the hardware design and optimized it quite a lot.
I think generalized kernel generation though may have more life in it than you might suspect! It is a fascinating field and I do not know nearly enough about it. I hope someday to be able to write my own Triton kernels/get to know how it integrates as a dynamic compiler for PyTorch code. We certainly live in wild times. Crazy indeed.
My claim is pretty subjective, but the idea is that there aren't many distinct kernels used in machine learning. It's all tensor contractions and element-wise operations. I'd argue that this can be maintained by hand without need for automation or high level abstraction.
Generalized kernel generation (i.e. synthesis of optimal performance from non-expert user defined kernels and novel hardware) would be fantastic to have, but it just doesn't seem particularly necessary in the field.
> Sure, the actual implementations can be tricky, but a single engineer (trained in it) can get that done for a specific hardware in a couple months.
I want to agree with you on this, but in practice, it's...
1. Hard to hire that engineer with deep expertise in handwritten kernels. CUDA engineers are still hard to come by and doesn't scale with productionized AI engineering demand.
2. "A few months" is a tough pill to swallow from an engineering roadmap POV, especially when models are deployed on a monthly basis. Most of the hand tuning efforts aren't scalable and will have to be done again on most iterations. This is especially true in reinforcement learning and robotics.
> But, in retrospect, the actual value add from a system that does all this for you is quite low. It's a realization I've been struggling to accept.
Yeah I remain neutral on this. On one hand, I can see that especially having to invest significant engineering effort (see point 2 above). On the other hand, you won't really know until you start benchmarking these models (and as you should).
> "Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task."
By committing it to a common library that a lot of people use? There are already multiple libraries with optimized matrix multiplication.
This is also exaggerating the expertise required. I'm not going to claim it's trivial, but you can genuinely google "intel avx-512 matrix multiplication", and find both papers and Intel samples.
> "Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task."
Naively, I wonder if this is the kind of problem that AI itself can solve, which is a rather singularity-approaching concept. Maybe there's too much logic involved and not enough training data on different configurations for that to work? A bit spooky however, the thought of self-bootstrapping AI.
There has been work on using AI for this at various levels - at the neural architecture level (finding neural architectures with high throughput/latency for a given hardware), at the algorithm level (finding faster matrix multiplication routines), and at the hardware level (iirc Google stated the latest version of google TPUs were partially designed with AI).
My take: optimizing matrix multiplication is not hard on modern architecture if you have the right abstraction. The code itself could be fragmented across different programming models, which is true, but the underlying techniques are not hard for a 2nd/3rd year undergrad to understand. There are only a few important ones on GPU: loop tiling, pipelining, shared memory swizzle, memory coalescing. A properly designed compiler can allow developers to optimize matmuls within 100 lines of code.
Looking at the effort plunked into things like cutlass and them still not reaching cuBLAS perf (which very few can beat - in the places where cuBLAS shines! which is... not that many...), and even in cuDNN and they're still eeking out single digit improvements regularly, I'd say this is probably harder than that. At least if you're reaching for the >50% use of the 37 TFLOPS of an A40. If you're fine throwing more GPUs at the problem, sure.
Edit: I mean when you still see papers every year with large improvements in perf, and things like 'we used tensor cores and managed to get back fp32 accuracy with 3 rounds of the things' - what? - I can attest it doesn't take 2 weeks to get this kind of results. And it's just getting started on tensor cores! And when on the nvidia forums someone says 'nah probably no improvement to use tensor cores for fft' and you get a link with a paper with a significative improvement in perf using tensor cores, I say we're just starting.
This is definitely a great point! With the context of AI workloads, where critical matmuls are basically of regular large shapes, are there many cases where cutlass/Triton are worse than cuBLAS where we need to throw more GPUs at it?
cuBLAS is very often too heavy (too much overhead, memory movement to fit the API, not optimized for small batches of small matrices) and you can get huge improvements while chaining cudnn/cutlass/autotuned kernels. Especially if you're still on GDDR6 every data movement is a killer so if you can put it all together and never go back to global memory, you get amazing improvements. And this is without tensor cores. Programming them by hand is a pain so here enters cutlass...
Yeah cuBLAS is definitely not perfect in many cases :-((
Speaking of GEMM fusion that you mentioned, flash attention is basically GEMM fusion with online softmax right? This is something I believe really cool and can be made really easy wit a proper abstraction. Say, you may move a chunk of computation under a certain loop and instruct the compiler to optimize data movement or cache intermediate tiles somewhere on chip
There's something of this in cutlass with prologues and epilogues, and in the 'backend mode' of cudnn, but overall breaking the 'cuBLAS takes your whole device and tries to saturate it for this one matmul' is going to require a huge lot of abstraction work.
Cutlass is supposed to be the first step and to anyone who struggles to understand WTF you're doing when using it, you are not alone. I've seen literally amazing room-silencing stuff with it, but heavy template stuff is really not my thing.
I am personally a really huge fan of cutlass, and I've almost read every single file in their `include/cutlass/` folder (haven't followed up with the `cute` stuff yet).
Just like you said, really appreciate that we could actually understand what is going on internally inside the kernel with cutlass, and customize it in a way that cuBLAS doesn't necessarily provide.
Have to agree with you that the template stuff is really annoying. Well, even with some template tricks, the error messages are still less readable, and it is where I think a better abstraction could benefit. Imagine you have an abstraction that simplifies the those threadblocks/warps/etc in a unified way while generalizing it to more backends (AMDGPU, Vulkan, AVX512-VNNI, etc), providing more friendly error messages along compilation given the abstraction is almost certainly more structured than pure c++ code.
I know what you mean. I guess Triton might be one way people are actually trying that, and further than that there's a lot of people putting years of work into MLIR-based tech, trying to abstract in a retargetable way the algorithm from its 'scheduling'. Might be worth a look if you're into this :-)
Ah nice! I intentionally didn’t talk a lot about “scheduling” because 1) I’m personally heavily working on it, which potentially makes a conflict of interest, and 2) I don’t want to deviate a lot from the topic in this thread about “optimize matmuls”. Check out my google scholar for more details though!
I love MLIR and Modular, so please do share more about it! If it’s potential distraction from this thread, I’m also open to email communication if you are interested!
Oh btw, to clarify, I’m not saying Triton is an ideal abstraction. I love it and it’s super popular because it’s the most user-friendly option for ML researchers to write performant kernels on certain gpus, but from a MLSys researcher’s perspective, I’m personally more ambitious and wanted to target broader range of hardwares. Also I really appreciate Philippe’s work that makes Triton really performant and easy to use.
> we used tensor cores and managed to get back fp32 accuracy with 3 rounds of the things
Hey are you referring to 3xTF32 (https://github.com/NVIDIA/cutlass/tree/master/examples/28_am...)? IMO this is a perfect example where proper abstraction could save engineers non-trivial amount of time - imagine a compiler stack which allows 3xTF32 as a normal dtype and subsequent analysis compatible with this special dtype :-)
> A properly designed compiler can allow developers to optimize matmuls within 100 lines of code.
man this is such a funny closing comment - what exactly do you think is involved in designing a compiler that enables devs to optimize matmuls if not 1000s of person hours/years/etc of very "fine-grained" perf research?
what the "abstraction" people don't understand (because they only deal in abstractions) is that achieving performance involves literally the antithesis of abstraction - you need to understand your hardware down to the gate level (sometimes).
have you ever applied any of these? the only way you could apply these as a generic (without consideration of your particular hardware) algo is using a tuner; this is of course widely the route taken but that's not an "understanding" of anything except guess and check.
Yes. I am the first author of the latest generation auto-tuner in an open source deep learning compiler :-)
My comment is based on my personal experience: I did lead a 2nd/3rd grade undergrad to add software pipelining support and it worked within 1 month; we did get cutlass-level performance within 100 lines of code specifying the design space.
yup exactly; it's like other comments on hn about nn frameworks:
"abstraction is the most important thing - look at pytorch it's the best framework because of the perfect/beautiful/brilliant abstractions" (re functorch or fx or dynamo).
ignoring entirely how much tedious and grueling bookkeeping/corner-casing/kernel-tuning (by a perpetual 100s of fulltime engineers) presenting such an "abstract" interface to the user requires.
Hey I am the first author of one of the "abstractions", so I guess my words would more or less reflect my personal daily experience dealing with those lovely kernels. Well, I don't have 100 engineers working for me, unfortunately :-(
Let's instead constructively talk about techniques in concrete items. If you look at OpenAI's Triton (which is also a small team of < 5 core contributors), what's this abstraction and their key to high performance? It's a tile-based programming model, where a tile could be conveniently lowered to vector instructions, coalesced memory access, and transformed to permuted layout. Its `dot` on tiles can be directly lowed to TensorCore-specific instructions. With those in design, without a huge team painfully maintaining the system, critical kernels like FlashAttention could be quickly developed within say 30 lines of code.
>Hey I am the first author of one of the "abstractions"
I know who you are and you should probably be out in the open with the fact that you have a conflict of interest in working at octo, a company that sells a very specific type of ML compiler.
>Let's instead constructively talk about techniques in concrete items. If you look at OpenAI's Triton
Pretty ironic you would call out Triton is being the right abstraction because while it is true philippe did a very good thing by moving things from warp level to block level, there is absolutely no one that thinks (myself included) that Triton is an abstraction.
I’m using my real name, so its not hard to know who I am.
Unfortunately, I don’t know much about you, and actually I don’t really think there is conflict of interest if you work in Modular, because Modular is also developing compiler abstractions, which is something I like and agree with, isn’t it? Let’s discuss about techniques, and it doesn’t have to be that heated :-)
To clarify, my point is matmuls can be solved with proper compiler abstractions, and it’s not that hard, and if you are working on a compiler, I believe you would more or less agree with that point, do you?
Liking Triton or not is a personal preference, and I use this as an example only because it’s gaining a lot of momentum at the moment, not saying it’s a perfect abstraction. If you personally don’t like it, I could also discuss about exo-compilation, tensor comprehension, but let’s always focus on concrete technical items :-)
The problem is already "solved" to almost everyone's satisfaction by being O(N), i.e. one optimized matrix math library per platform.
But if they can reduce that to O(1) by creating a tool that takes computing hardware characteristics (core/compute topology, instructions, memory heirarchy, ...), and outputs state-of-the-art optimized matrix multiply machine code, it would be a nice and useful result.
I really like the Neanderthal library because it does a pretty good job of abstracting over Nvidia, AMD, and Intel hardware to provide matrix operations in an extremely performant manner for each one with the same code. Dragan goes into a lot of detail about the hardware differences. His library provides some of the fastest implementations of using the given hardware too, it's not a hand-wavy, half-baked performance abstraction, the code is really fast.
https://github.com/uncomplicate/neanderthal
Surely one solution is for the AI frameworks to each themselves understand the operating environment and choose the best implementation at run-time, much like the way they currently do.
And they developed this fragmentation by... building good tools, good documentation, and comprehensively supporting them for 15 years in a way that makes people feel safe building on top of them.
And with their actual understanding of the hardware limitations of GPUs (memory bandwidth) and the parallel work on things like cutlass (if there was ever an unportable thing :-), the coming *Dx libraries (the explosion of cuBLAS/Solver/FFT to allow kernel fusion and new in-kernel linear algebra shenanigans) the slow but steady introduction of sparsity everywhere, I can't see how anyone can but play catch-up.
It's not like other vendors have made meaningful efforts in alternatives. AMD still hasn't released RDNA3 support for ROCm, their open compute platform. Hell, I don't even think RDNA2 has proper support as of now.
There's also the issue of poor documentation and learning material in the wild.
yeah when getting DL up and running on AMD requires using a datacentre card then it's no wonder CUDA is more popular. AMD is enabling ROCm for commercial GPUs now but it's still a pain to get it up and running, because of the inertia that CUDA has.
they are the one vendor who had the insight ~20 years ago to invest long-term in GPUs and have continuously made impressive products while supporting a cross-platform developer base. For this, I reward them with my $$$ (both work and home).
> performance has become increasingly constrained by memory latency, which has grown much slower than processing speeds.
Sounds like they would oddly prefer memory latency to grow as least as fast as processing speeds, which would be terrible. Obviously, memory latency actually decreased, just not enough.
So it seems likely they made a mistake and actually meant that memory latency has decreased slower than processing speeds have increased, in other words, that it is not memory latency but memory random access throughput (which in rough approximation is about proportional to the inverse of memory latency) that has grown much slower than processing speeds.
https://opensource.googleblog.com/2023/03/openxla-is-ready-t...
> OpenXLA is an open source ML compiler ecosystem co-developed by AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA. It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware