Looking at the effort plunked into things like cutlass and them still not reachi...

junrushao1994 · on March 23, 2023

This is definitely a great point! With the context of AI workloads, where critical matmuls are basically of regular large shapes, are there many cases where cutlass/Triton are worse than cuBLAS where we need to throw more GPUs at it?

touisteur · on March 23, 2023

cuBLAS is very often too heavy (too much overhead, memory movement to fit the API, not optimized for small batches of small matrices) and you can get huge improvements while chaining cudnn/cutlass/autotuned kernels. Especially if you're still on GDDR6 every data movement is a killer so if you can put it all together and never go back to global memory, you get amazing improvements. And this is without tensor cores. Programming them by hand is a pain so here enters cutlass...

junrushao1994 · on March 23, 2023

Yeah cuBLAS is definitely not perfect in many cases :-((

Speaking of GEMM fusion that you mentioned, flash attention is basically GEMM fusion with online softmax right? This is something I believe really cool and can be made really easy wit a proper abstraction. Say, you may move a chunk of computation under a certain loop and instruct the compiler to optimize data movement or cache intermediate tiles somewhere on chip

touisteur · on March 23, 2023

There's something of this in cutlass with prologues and epilogues, and in the 'backend mode' of cudnn, but overall breaking the 'cuBLAS takes your whole device and tries to saturate it for this one matmul' is going to require a huge lot of abstraction work.

Cutlass is supposed to be the first step and to anyone who struggles to understand WTF you're doing when using it, you are not alone. I've seen literally amazing room-silencing stuff with it, but heavy template stuff is really not my thing.

junrushao1994 · on March 24, 2023

I am personally a really huge fan of cutlass, and I've almost read every single file in their `include/cutlass/` folder (haven't followed up with the `cute` stuff yet).

Just like you said, really appreciate that we could actually understand what is going on internally inside the kernel with cutlass, and customize it in a way that cuBLAS doesn't necessarily provide.

Have to agree with you that the template stuff is really annoying. Well, even with some template tricks, the error messages are still less readable, and it is where I think a better abstraction could benefit. Imagine you have an abstraction that simplifies the those threadblocks/warps/etc in a unified way while generalizing it to more backends (AMDGPU, Vulkan, AVX512-VNNI, etc), providing more friendly error messages along compilation given the abstraction is almost certainly more structured than pure c++ code.

touisteur · on March 24, 2023

I know what you mean. I guess Triton might be one way people are actually trying that, and further than that there's a lot of people putting years of work into MLIR-based tech, trying to abstract in a retargetable way the algorithm from its 'scheduling'. Might be worth a look if you're into this :-)

junrushao1994 · on March 24, 2023

Ah nice! I intentionally didn’t talk a lot about “scheduling” because 1) I’m personally heavily working on it, which potentially makes a conflict of interest, and 2) I don’t want to deviate a lot from the topic in this thread about “optimize matmuls”. Check out my google scholar for more details though!

I love MLIR and Modular, so please do share more about it! If it’s potential distraction from this thread, I’m also open to email communication if you are interested!

Oh btw, to clarify, I’m not saying Triton is an ideal abstraction. I love it and it’s super popular because it’s the most user-friendly option for ML researchers to write performant kernels on certain gpus, but from a MLSys researcher’s perspective, I’m personally more ambitious and wanted to target broader range of hardwares. Also I really appreciate Philippe’s work that makes Triton really performant and easy to use.

junrushao1994 · on March 23, 2023

> we used tensor cores and managed to get back fp32 accuracy with 3 rounds of the things

Hey are you referring to 3xTF32 (https://github.com/NVIDIA/cutlass/tree/master/examples/28_am...)? IMO this is a perfect example where proper abstraction could save engineers non-trivial amount of time - imagine a compiler stack which allows 3xTF32 as a normal dtype and subsequent analysis compatible with this special dtype :-)