Hacker News new | past | comments | ask | show | jobs | submit login

Looking at the effort plunked into things like cutlass and them still not reaching cuBLAS perf (which very few can beat - in the places where cuBLAS shines! which is... not that many...), and even in cuDNN and they're still eeking out single digit improvements regularly, I'd say this is probably harder than that. At least if you're reaching for the >50% use of the 37 TFLOPS of an A40. If you're fine throwing more GPUs at the problem, sure.

Edit: I mean when you still see papers every year with large improvements in perf, and things like 'we used tensor cores and managed to get back fp32 accuracy with 3 rounds of the things' - what? - I can attest it doesn't take 2 weeks to get this kind of results. And it's just getting started on tensor cores! And when on the nvidia forums someone says 'nah probably no improvement to use tensor cores for fft' and you get a link with a paper with a significative improvement in perf using tensor cores, I say we're just starting.




This is definitely a great point! With the context of AI workloads, where critical matmuls are basically of regular large shapes, are there many cases where cutlass/Triton are worse than cuBLAS where we need to throw more GPUs at it?


cuBLAS is very often too heavy (too much overhead, memory movement to fit the API, not optimized for small batches of small matrices) and you can get huge improvements while chaining cudnn/cutlass/autotuned kernels. Especially if you're still on GDDR6 every data movement is a killer so if you can put it all together and never go back to global memory, you get amazing improvements. And this is without tensor cores. Programming them by hand is a pain so here enters cutlass...


Yeah cuBLAS is definitely not perfect in many cases :-((

Speaking of GEMM fusion that you mentioned, flash attention is basically GEMM fusion with online softmax right? This is something I believe really cool and can be made really easy wit a proper abstraction. Say, you may move a chunk of computation under a certain loop and instruct the compiler to optimize data movement or cache intermediate tiles somewhere on chip


There's something of this in cutlass with prologues and epilogues, and in the 'backend mode' of cudnn, but overall breaking the 'cuBLAS takes your whole device and tries to saturate it for this one matmul' is going to require a huge lot of abstraction work.

Cutlass is supposed to be the first step and to anyone who struggles to understand WTF you're doing when using it, you are not alone. I've seen literally amazing room-silencing stuff with it, but heavy template stuff is really not my thing.


I am personally a really huge fan of cutlass, and I've almost read every single file in their `include/cutlass/` folder (haven't followed up with the `cute` stuff yet).

Just like you said, really appreciate that we could actually understand what is going on internally inside the kernel with cutlass, and customize it in a way that cuBLAS doesn't necessarily provide.

Have to agree with you that the template stuff is really annoying. Well, even with some template tricks, the error messages are still less readable, and it is where I think a better abstraction could benefit. Imagine you have an abstraction that simplifies the those threadblocks/warps/etc in a unified way while generalizing it to more backends (AMDGPU, Vulkan, AVX512-VNNI, etc), providing more friendly error messages along compilation given the abstraction is almost certainly more structured than pure c++ code.


I know what you mean. I guess Triton might be one way people are actually trying that, and further than that there's a lot of people putting years of work into MLIR-based tech, trying to abstract in a retargetable way the algorithm from its 'scheduling'. Might be worth a look if you're into this :-)


Ah nice! I intentionally didn’t talk a lot about “scheduling” because 1) I’m personally heavily working on it, which potentially makes a conflict of interest, and 2) I don’t want to deviate a lot from the topic in this thread about “optimize matmuls”. Check out my google scholar for more details though!

I love MLIR and Modular, so please do share more about it! If it’s potential distraction from this thread, I’m also open to email communication if you are interested!

Oh btw, to clarify, I’m not saying Triton is an ideal abstraction. I love it and it’s super popular because it’s the most user-friendly option for ML researchers to write performant kernels on certain gpus, but from a MLSys researcher’s perspective, I’m personally more ambitious and wanted to target broader range of hardwares. Also I really appreciate Philippe’s work that makes Triton really performant and easy to use.


> we used tensor cores and managed to get back fp32 accuracy with 3 rounds of the things

Hey are you referring to 3xTF32 (https://github.com/NVIDIA/cutlass/tree/master/examples/28_am...)? IMO this is a perfect example where proper abstraction could save engineers non-trivial amount of time - imagine a compiler stack which allows 3xTF32 as a normal dtype and subsequent analysis compatible with this special dtype :-)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: