A lot of people think AMD should support these translation layers but I think it's a bad idea. CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult both technically and legally. For example I think it would be against the license agreement of cuDNN or cuBLAS to run them on this. So those and other Nvidia libraries would become part of the API boundary that AMD would need to reimplement and support.
Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.
Are you aware of HIP? It's officially supported and, for code that avoids obscure features of CUDA like inline PTX, it's pretty much a find-and-replace to get a working build:
if you're talking about building anything, that is already too hard for ML researchers.
you have to be able to pip install something and just have it work, reasonably fast, without crashing, and also it has to not interfere with 100 other weird poorly maintained ML library dependencies.
If your point is that HIP is not a zero-effort porting solution, that is correct. HIP is a low-effort solution, not a zero effort solution. It targets users who already use and know CUDA, and minimizes the changes that are required from pre-existing CUDA code.
In the case of these abstraction layers, then it would be the responsibility of the abstraction maintainers (or AMD) to port them. Obviously, someone who does not even use CUDA would not use HIP either.
To be honest, I have a hard time believing that a truly zero-effort solution exists. Especially one that gets high performance. Once you start talking about the full stack, there are too many potholes and sharp edges to believe that it will really work. So I am highly skeptical of original article. Not that I wouldn't want to be proved wrong. But what they're claiming to do is a big lift, even taking HIP as a starting point.
The easiest, fastest (for end users), highest-performance solution for ML will come when the ecosystem integrates it natively. HIP would be a way to get there faster, but it will take nonzero effort from CUDA-proficient engineers to get there.
As other commenters have pointed out, this is probably a good solution for HPC jobs where everyone is using C++ or Fortran anyway and you frequently write your own CUDA kernels.
From time to time I run into a decision maker who understandably wants to believe that AMD cards are now "ready" to be used for deep learning, and points to things like the fact that HIP mostly works pretty well. I was kind of reacting against that.
As someone doing a lot of work with CUDA in a big research organization, there are few of us. If you are working with CUDA, then you are not from the type of people who wait to have something that just works like you describe. CUDA itself is a battle with poorly documented stuff.
Don’t most orgs that are deep enough to run custom cuda kernels have dedicated engineers for this stuff. I can’t imagine a person who can write raw cuda not being able to handle things more difficult than pip install.
Engineers who are really, really good at CUDA are worth their weight in gold, so there's more projects for them than they have time. Worth their weight in gold isn't figurative here – the one I know has a ski house more expensive than 180 lbs of gold (~$5,320,814).
The fact that "worth their weight in cold" typically means in the single-digit millions is fascinating to me (though I doubt I'll be able to get there myself, maybe someday). I looked it up though and I think this is undercounting the current value of gold per ounce/lb/etc.
It's worth noting that anyone with a ski house that expensive probably has a net worth well over twice the price of that ski house. I guess it's time to start learning CUDA!
> That growth for _gold_ of all things (up 71% in the last 5 years) is crazy to me.
For comparison: S&P500 grew about the same during that period (more than 100% from Jan 2019, about 70 from Dec 2019), so the higher price of gold did not outperform the growth of the general (financial) economy.
But that's still surprising performance, because the S&P generates income and pays dividends. Its increase reflects (at least, is supposed to!) expectations of future higher income. Gold doesn't even bear interest....
Gold is commonly seen as a hedge against inflation and a decently stable non-currency store of value. With many countries having/being perceived to have high inflation during this time, the price of gold is bound to rise as well. Pretty much any economic or sociopolitical tremor will bounce up the price of the gold at least temporarily.
The S&P doesn't really pay much in the way of dividends does it? Last time I checked it was order-of-magnitude 1% which is a bit of a joke figure.
Anyway, there isn't a lot of evidence that the value of gold is going up. It seems to just be keeping pace with the M2. Both doubled-and-a-bit since 2010 (working in USD).
A working knowledge of C++, plus a bit of online reading about CUDA and the NVidia GPU architecture, plus studying the LCZero chess engine source code (the CUDA neural net part, I mean) seems like enough to get started. I did that and felt like I could contribute to that code, at least at a newbie level, given the hardware and build tools. At least in the pre-NNUE era, the code was pretty readable. I didn't pursue it though.
Of course becoming "really good" is a lot different and like anything else, it presumably takes a lot of callused fingertips (from typing) to get there.
Having dabbled in CUDA, but not worked on it professionally, it feels like a lot of the complexity isn't really in CUDA/C++, but in the algorithms you have to come up with to really take advantage of the hardware.
Optimizing something for SIMD execution isn't often straightforward and it isn't something a lot of developers encounter outside a few small areas. There are also a lot of hardware architecture considerations you have to work with (memory transfer speed is a big one) to even come close to saturating the compute units.
The real challenge is probably getting your hands on a 4090 for a price you can pay before you are worth your weight in gold. Because an arm and a limb in gold is quite a lot.
You don't really need a 4090. An older board is plenty. The software is basically the same. I fooled around with what I think was a 1080 on Paperspace for something like 50 cents an hour, but it was mostly with some Pytorch models rather than CUDA directly.
Really old GPU's were different but the 1080 is similar to later stuff with a few features missing. Half precision and "tensor cores" iirc. It could be that the very most recent stuff has changed more (I haven't paid attention) but I thought that the 4090 was just another evolutionary step.
Everyone and I mean everyone I know doing AI / ML work values VRAM above all. The absolute bang for buck are buying used p40's and if you actually want to have those cards be usable for other stuff, used 3090's are the best deal around and they should be ~ $700 right now.
Well, to give an example, 32GB of vram would be vastly more preferable to 24GB of higher bandwidth vram. You really need to be able to put the entire LLM in memory for best results, because otherwise you're bottlenecking on the speed of transfer between regular old system ram and the gpu.
You'll also note that M1/2 macs with large amounts of system memory are good at inference because of the fact that the gpu has a very high speed interconnect between the soldiered on ram modules and the on die gpu. It's all about avoiding bottlenecks whereever possible.
Not really any paradigm shift since the introduction of Tensor Cores in NVIDIA archs. Anything Ampere or Lovelace, will do to teach yourself CUDA up to the crazy optimization techniques and the worst libraries that warp the mind. You'll only miss on HBM which allows you to cheat on memory bandwidth, amount of VRAM (teach yourself on smaller models...), double precision perf and double precision tensor cores (go for an A30 then and not sure they'll keep them - either the x30 bin, or DP tensor cores - ever since "DGEMM on Integer Matrix Multiplication Unit" - https://arxiv.org/html/2306.11975v4 ). FP4, DPX, TMA, GPUDirect are nice but you must be pretty far out already for them to be mandatory...
I was looking into this recently and it seems like the cheapest AWS instance with a CUDA GPU is something on the order of $1/hr. It looks like an H100 instance might be $15/hr (although I’m not sure if I’m looking at a monthly price).
So yeah it’s not ideal if you’re on a budget, but it seems like there are some solutions that don’t involve massive capex.
Look on vast.ai instead of AWS, you can rent machines with older GPU's dirt cheap. I don't see how they even cover the electricity bills. A 4090 machine starts at about $.25/hour though I didn't examine the configuration.
Thrashed? What type of damage could a mostly-solid state device suffer? Fan problems? Worn PCi connectors? Deteriorating Arctic Ice from repeated heat cycling?
Heat. A lot of components - and not just in computers but everything hardware - are spec'd for something called "duty cycles", basically how long a thing is active in a specific time frame.
Gaming cards/rigs, which many of the early miners were based on, rarely run at 100% all the time, the workload is burst-y (and distributed amongst different areas of the system). In comparison, a miner runs at 100% all the time.
On top of that, for silicon there is an effect called electromigration [1], where the literal movement of electrons erodes the material over time - made worse by ever shrinking feature sizes as well as, again, the chips being used in exactly the same way all the time.
When people were mining Ethereum (which was the last craze that GPUs were capable of playing in -- BTC has been off the GPU radar for a long time), profitable mining was fairly kind to cards compared to gaming.
Folks wanted their hardware to produce as much as possible, for as little as possible, before it became outdated.
The load was constant, so heat cycles weren't really a thing.
That heat was minimized; cards were clocked (and voltages tweaked) to optimize the ratio of crypto output to Watts input. For Ethereum, this meant undervolting and underclocking the GPU -- which are kind to it.
Fan speeds were kept both moderate and tightly controlled; too fast, and it would cost more (the fans themselves cost money to run, and money to replace). Too slow, and potential output was left on the table.
For Ethereum, RAM got hit hard. But RAM doesn't necessarily care about that; DRAM in general is more or less just an array of solid-state capacitors. And people needed that RAM to work reliably -- it's NFG to spend money producing bad blocks.
Power supplies tended to be stable, because good, cheap, stable, high-current, and stupidly-efficient are qualities that go hand-in-hand thanks to HP server PSUs being cheap like chips.
There were exceptions, of course: Some people did not mine smartly.
---
But this is broadly very different from how gamers treat hardware, wherein: Heat cycles are real, over clocking everything to eek out an extra few FPS is real, pushing things a bit too far and producing glitches can be tolerated sometimes, fan speeds are whatever, and power supplies are picked based on what they look like instead of an actual price/performance comparison.
A card that was used for mining is not implicitly worse in any way than one that was used for gaming. Purchasing either thing involves non-zero risk.
> That heat was minimized; cards were clocked (and voltages tweaked) to optimize the ratio of crypto output to Watts input. For Ethereum, this meant undervolting and underclocking the GPU -- which are kind to it.
> Fan speeds were kept both moderate and tightly controlled; too fast, and it would cost more (the fans themselves cost money to run, and money to replace). Too slow, and potential output was left on the table.
In the ideal case, this is spot on. Annoyingly however, this hinges on the assumption of an awful lot of competence from top to bottom.
If I've learned anything in my considerable career, it's that reality is typically one of the first things tossed when situations and goals become complex.
The few successful crypto miners maybe did some of the optimizations you mention. The odds aren't good enough for me to want to purchase a Craigslist or FB marketplace card for only a 30% discount.
I kinda doubt it. Nobody paid me to do that though. I was just interested in LCZero. To get that $500k/year, I think you need up to date ML understanding and not just CUDA. CUDA is just another programming language while ML is a big area of active research. You could watch some of the fast.ai ML videos and then enter some Kaggle competitions if you want to go that route.
You're wrong. The people building the models don't write CUDA kernels. The people optimizing the models write CUDA kernels. And you don't need to know a bunch of ML bs to optimize kernels. Source: I optimize GPU kernels. I don't make 500k but I'm not that far from.
How much performance difference is there between writing a kernel in a high level language/framework like PyTorch (torch.compile) or Triton, and hand optimizing? Are you writing kernels in PTX?
What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?
Heh I'm in the wrong business then. Interesting. Used to be that game programmers spent lots of time optimizing non-ML CUDA code. They didn't make anything like 500k at that time. I wonder what the ML industry has done to game development, or for that matter to scientific programming. Wow.
That’s pretty funny. Good test of value across the millennia. I wonder if the best aqueduct engineers during the peak of Ancient Rome’s power had villas worth their body weight in gold.
Selection bias. I'm sure there are lots of people who are really good at CUDA and don't have those kind of assets. Not everyone knows how to sell their skills.
Right now, nvidias valuations have made a lot of people realize that their CUDA skills were being undervalued. Anyone with GPU or ML skills who hasn’t tried to get a pay raise in this market deserves exactly the life that they are living.
>> Don’t most orgs that are deep enough to run custom cuda kernels have dedicated engineers for this stuff. I can’t imagine a person who can write raw cuda not being able to handle things more difficult than pip install.
This seems to be fairly common problem with software. The people who create software regularly deal with complex tool chains, dependency management, configuration files, and so on. As a result they think that if a solutions "exists" everything is fine. Need to edit a config file for your particular setup? No problem. The thing is, I have been programming stuff for decades and I really hate having to do that stuff and will avoid tools that make me do it. I have my own problems to solve, and don't want to deal with figuring out tools no matter how "simple" the author thinks that is to do.
A huge part of the reason commercial software exists today is probably because open source projects don't take things to this extreme. I look at some things that qualify as products and think they're really simplistic, but they take care of some minutia that regular people are will to pay so they don't have to learn or deal with it. The same can be true for developers and ML researchers or whatever.
> if you're talking about building anything, that is already too hard for ML researchers.
I don't think so. I agree it is too hard for the ML researches at the companies which will have their rear ends handed to them by the other companies whose ML researchers can be bothered to follow a blog post and prompt ChatGPT to resolve error messages.
I'm not really talking about companies here for the most part, I'm talking about academic ML researchers (or industry researchers whose role is primarily academic-style research). In companies there is more incentive for good software engineering practices.
I'm also speaking from personal experience: I once had to hand-write my own CUDA kernels (on official NVIDIA cards, not even this weird translation layer): it was useful and I figured it out, but everything was constantly breaking at first.
It was a drag on productivity and more importantly, it made it too difficult for other people to run my code (which means they are less likely to cite my work).
The target audience of interoperability technology is whoever is building, though. Ideally, interoperability technology can help software that supports only NVIDIA GPUs today go on to quickly add baseline support for Intel and AMD GPUs tomorrow.
(and for one data point, I believe Blender is actively using HIP for AMD GPU support in Cycles.)
There is more glaring issue, ROCm doesn't even work well on most AMD devices nowadays, and hip performance wise deterioriates on the same hardware compared to ROCm.
If you want to write very efficient CUDA kernel for modern datacenter NVIDIA GPU (read H100), you need to write it with having hardware in mind (and preferably in hands, H100 and RTX 4090 behave very differently in practice). So I don't think the difference between AMD and NVIDIA is as big as everyone perceives.
Yeah, a lot of the newer accelerators are not even available without using inline PTX assembly. Even the ones that are have weird shapes that are not amenable to high-performance work.
For any compiler, "supporting" a certain CPU or GPU only means that they can generate correct translated code with that CPU or GPU as the execution target.
It does not mean that the compiler is able to generate code that has optimal performance, when that can be achieved by using certain instructions without a direct equivalent in a high-level language.
No compiler that supports the Intel-AMD ISA knows how to use all the instructions available in this ISA.
Sure, but I'm not sure if that is what the parent poster was saying (that nvcc generates poor quality PTX for newer devices).
It's been a while since I looked at CUDA, but it used to be that NVIDIA were continually extending cuDNN to add support for kernels needed by SOTA models, and I assume these kernels were all hand optimized.
I'm curious what kind of models people are writing where not only is there is no optimized cuDNN support, but also solutions like Triton or torch.compile, and even hand optimized CUDA C kernels are too slow. Are hand written PTX kernels really that common ?
Yes. Take a look at, say, CUTLASS: you'll see that they use PTX instructions because there are no intrinsics, much less automatic compiler lowering, for the accelerators they target.
Yes, but that's an NVIDIA project, so would be expected to be hand optimized, same as their cuDNN kernels.
I'm more curious about what types of model people in research or industry are developing, where NVIDIA support such as this is not enough, and they are developing their own PTX kernels.
Support this, reimplement that, support upstream efforts, dont really care. Any of those would cost a couple of million and be worth a trillion dollars to AMD shareholders.
Is it weird how the comments here are blaming AMD and not Nvidia? Sure, the obvious argument is that Nvidia has no practical motivation to build an open platform. But there are counterexamples that suggest otherwise (Android). And there is a compelling argument that long term, their proprietary firmware layer will become an insufficient moat to their hardware dominance.
Who’s the root cause? The company with the dominant platform that refuses to open it up, or the competitor who can’t catch up because they’re running so far behind? Even if AMD made their own version of CUDA that was better in every way, it still wouldn’t gain adoption because CUDA has become the standard. No matter what they do, they’ll need to have a compatibility layer. And in that case maybe it makes sense for them to invest in the best one that emerges from the community.
> Is it weird how the comments here are blaming AMD and not Nvidia?
Nvidia has put in the legwork and are reaping the rewards. They've worked closely with the people who are actually using their stuff, funding development and giving loads of support to researchers, teachers and so on, for probably a decade now. Why should they give all that away?
> But there are counterexamples that suggest otherwise (Android).
How is Android a counterexample? Google makes no money off of it, nor does anyone else. Google keeps Android open so that Apple can't move everyone onto their ad platform, so it's worth it for them as a strategic move, but Nvidia has no such motive.
> Even if AMD made their own version of CUDA that was better in every way, it still wouldn’t gain adoption because CUDA has become the standard.
Maybe. But again, that's because NVidia has been putting in the work to make something better for a decade or more. The best time for AMD to start actually trying was 10 years ago; the second-best time is today.
> Google makes no money off of it, nor does anyone else
Google makes no money off of Android? That seems like a really weird claim to make. Do you really think Google would be anywhere near as valuable of a company if iOS had all of the market share that the data vacuum that is Android has? I can't imagine that being the case.
Google makes a boatload off of Android, just like AMD would if they supported open GPGPU efforts aggressively.
Android is a complement to Google's business, which is when open source works. What would be the complement worth $1 Trillion to NVIDIA to build a truly open platform? There isn't one. That was his point.
There’s an entire derivative industry of GPUs, namely GenAI and LLM providers, that could be the “complement” to an open GPU platform. The exact design and interface between such a complement and platform is yet undefined, but I’m sure there are creative approaches to this problem.
And NVIDIA is playing in that game too. Why would they not play in higher level services as well? They already publish the source to their entire software stack. A comparison to Android is completely useless. Google is a multi-sided platform that does lots of things for free for some people (web users, Android users) so it can charge other people for their data (ad buyers). That isn't the chip business whatsoever. The original comment only makes sense if you know nothing about their respective business models.
Yes, so when the ground inevitably shifts below their feet (it might happen years from now, but it will happen – open platforms always emerge and eventually proliferate), wouldn’t it be better for them to own that platform?
On the other hand, they could always wait for the most viable threat to emerge and then pay a few billion dollars to acquire it and own its direction. Google didn’t invent Android, after all…
> Google is a multi-sided platform that does lots of things for free for some people… That isn't the chip business whatsoever.
This is a reductionist differentiation that overlooks the similarities between the platforms of “mobile” and “GPU” (and also mischaracterizes the business model of Google, who does in fact make money directly from Android sales, and even moved all the way down the stack to selling hardware). In fact there is even a potentially direct analogy between the two platforms: LLM is the top of the stack with GPU on the bottom, just like Advertising is the top of the stack with Mobile on the bottom.
Yes, Google’s top level money printer is advertising, and everything they do (including Android) is about controlling the maximum number of layers below that money printer. But that doesn’t mean there is no benefit to Nvidia doing the same. They might approach it differently, since they currently own the bottom layer whereas Google started from the top layer. But the end result of controlling the whole stack will lead to the same benefits.
And you even admit in your comment that Nvidia is investing in these higher levels. My argument is that they are jeopardizing the longevity of these high-level investments due to their reluctance to invest in an open platform at the bottom layer (not even the bottom, but one level above their hardware). This will leave them vulnerable to encroachment by a player that comes from a higher level, like OpenAI for example, who gets to define the open platform before Nvidia ever has a chance to own it.
> it might happen years from now, but it will happen – open platforms always emerge and eventually proliferate
30 years ago people were making the same argument that MS should have kept DirectX open or else they were going to lose to OpenGL. Look how that's worked out for them.
> Google, who does in fact make money directly from Android sales
They don't though. They have some amount of revenue from it, but it's a loss-making operation.
> In fact there is even a potentially direct analogy between the two platforms: LLM is the top of the stack with GPU on the bottom, just like Advertising is the top of the stack with Mobile on the bottom.
But which layer is the differentiator, and which layer is just commodity? Google gives away Android because it isn't better than iOS and isn't trying to be; "good enough" is fine for their business (if anything, being open is a way to stay relevant where they would otherwise fall behind). They don't give away the ad-tech, nor would they open up e.g. Maps data where they have a competitive advantage.
NVidia has no reason to open up CUDA; they have nothing to gain and a lot to lose by doing so. They make a lot of their money from hardware sales which they would open up to cannibalisation, and CUDA is already the industry standard that everyone builds on and stays compatible with. If there was ever a real competitive threat then that might change, but AMD has a long way to go to get there.
"Open up CUDA" - guys, its all open source. What do you want them to do? Do tech support to help their competitors compete against them? AMD is to blame for not building this project 10 years ago.
Google gave away the software platform - Android - to hardware vendors for free, vendors compete making the hardware into cheap, low-margin commodity items, and google makes boatloads of money from ads, tracking and the app store.
nvidia could give away the software platform - CUDA - to hardware vendors for free, making the hardware into cheap, low-margin commodity items. But how would they make boatloads of money when there's nowhere to put ads, tracking or an app store?
>Is it weird how the comments here are blaming AMD and not Nvidia?
It's not. Even as it is, I do not trust HIP or RocM to be a viable alternative to Cuda. George Hotz did plenty of work trying to port various ML architectures to AMD and was met with countless driver bugs. The problem isn't nvidia won't build an open platform - the problem is AMD won't invest in a competitive platform. 99% of ML engineers do not write CUDA. For the vast majority of workloads, there are probably 20 engineers at Meta who write the Cuda backend for Pytorch that every other engineer uses. Meta could hire another 20 engineers to support whatever AMD has (they did, and it's not as robust as CUDA).
Even if CUDA was open - do you expect nvidia to also write drivers for AMD? I don't believe 3rd parties will get anywhere writing "compatibility layers" because AMD's own GPU aren't optimized or tested for CUDA-like workloads.
Khrons, AMD and Intel have had 15 years to make something out of OpenCL that could rival CUDA.
Instead they managed 15 years of disappointment, in a standard stuck in C99, that adopted C++ and a polyglot bytecode too late to matter, never produced an ecosystem of IDE tooling and GPU libraries.
Naturally CUDA became the standard, when NVIDIA provided what the GPU community cared about.
> Is it weird how the comments here are blaming AMD and not Nvidia?
Not even a little bit. It simply isn't Nvidia's job to provide competitive alternatives to Nvidia. Competing is something AMD must take responsibility for.
The only reason CUDA is such a big talking point is because AMD tripped over their own feet supporting accelerated BLAS on AMD GPUs. Realistically it probably is hard to implement (AMD have a lot of competent people on staff) but Nvidia hasn't done anything unfair apart from execute so well that they make all the alternatives look bad.
Huh? Why the sarcasm? You think it's a good thing that someone besides the person who owns the hardware has the final say on what the hardware is allowed to be used for?
That's not actually a thing? I specifically moved away from Nvidia because
1) they choose (chose?) not to supprt standard display protocols that Wayland compositors target with their drivers (annoying, but not the end of the world)
2) they cryptographically lock users out of writing their own drivers for their own graphics cards (which should be illegal and is exactly contradictory to "that's not actually a thing").
Again: look into why the Nouveau driver performance is limited.
This seems to be more about certain devices (consumer-grade GPUs) in certain settings (data centers), though I do question how enforceable it actually is. My guess is that it can only apply when you try to get discounts from bulk-ordering GPUs.
Also, was there any followup to this story? It seems a bit unnecessary because nVidia has already neutered consumer cards for many/most data center purposes by not using ECC and by providing so few FP64 units that double precision FLOPS is barely better than CPU SIMD.
it’s also not really a thing anymore because of the open kernel driver… at that point it’s just MIT licensed.
of course people continued to melt down about that for some reason too, in the customary “nothing is ever libre enough!” circular firing squad. Just like streamline etc.
There’s a really shitty strain of fanboy thought that wants libre software to be actively worsened (even stonewalled by the kernel team if necessary) so that they can continue to argue against nvidia as a bad actor that doesn’t play nicely with open source. You saw it with all these things but especially with the open kernel driver, people were really happy it didn’t get upstreamed. Shitty behavior all around.
You see it every time someone quotes Linus Torvalds on the issue. Some slight from 2006 is more important than users having good, open drivers upstreamed. Some petty brand preferences are legitimately far important than working with and bringing that vendor into the fold long-term, for a large number of people. Most of whom don’t even consider themselves fanboys! They just say all the things a fanboy would say, and act all the ways a fanboy would act…
>Is it weird how the comments here are blaming AMD and not Nvidia?
Because it IS AMD/Apple/etcs fault for the position they're in right now. CUDA showed where the world was heading and where the gains in compute would be made well over a decade ago now.
They even had OpenCL, didn't put the right amount of effort into it, all the talent found CUDA easier to work with so built there. Then what did AMD, Apple do? Double down and try and make something better and compete? Nah they fragmented and went their own way, AMD with what feels like a fraction of the effort even Apple put in.
From the actions of the other teams in the game it's not hard to imagine a world without CUDA being a world where this tech is running at a fraction of it's potential.
It's always been on the straggler to catch up by cheating. That's just how the world works - even in open source. If AMD supported CUDA, it would have a bigger market share. That's a fact. Nvidia doesn't want that. That's a fact. But when Reddit started, it just scraped feeds from Digg, and when Facebook started, it let you link your MySpace credentials and scraped your MySpace account. Adversarial interoperability is nothing new.
Funnily, who I blame the most for there not being real competition to CUDA is apple. As of late, Apple has been really pushing for vender lock in APIs rather than adopting open standards. The end result is you can get AMD and Intel onboard with some standard which is ultimately torpedoed by apple. (See apple departing from and rejecting everything that comes from the khronos group).
With the number of devs that use Apple silicon now-a-days, I have to think that their support for khronos initiatives like SYCL and OpenCL would have significantly accelerated progress and adoption in both.
We need an open standard that isn't just AMD specific to be successful in toppling CUDA.
Pretty much any modern NVIDIA GPU supports CUDA. You don't have to buy a datacenter-class unit to get your feet wet with CUDA programming. ROCm will count as "something" when the same is true for AMD GPUs.
ROCm supports current gen consumer gpus officially and a decent chunk of recent gen consumer gpus unofficially. Not all of them of course but a decent chunk.
It's not ideal but I'm pretty sure CUDA didn't support everything from day 1. And ROCm is part of AMD's vendor part of the Windows AI stack so from upcoming gen on out basically anything that outputs video should support ROCm.
No, but CUDA at least supported the 8800 gt on release [1]. ROCm didn't support any consumer cards on release, looks like they didn't support any till last year? [2]
I don't think AMD needs to support 5+ year old GPUs personally. And all the recent generations are already practically supported.
AMD only claims support for a select few GPUs, but in my testing I find all the GPUs work fine if the architecture is supported. I've tested rx6600, rx6700xt for example and even though they aren't officially supported, they work fine on ROCm.
AMD had a big architecture switchover exactly 5 years ago, and the full launch wasn't over until 4.5 years ago. I think that generation should have full support. Especially because it's not like they're cutting support now. They didn't support it at launch, and they didn't support it after 1, 2, 3, 4 years either.
The other way to look at things, I'd say that for a mid to high tier GPU to be obsolete based on performance, the replacement model needs to be over twice as fast. 7700XT is just over 50% faster than 5700XT.
I'm on a 5+ year old GPU, because I don't trust AMD to offer a compelling GPU that actually works. An RX 7 570 is good enough for the little gaming I do. It mostly acts as an oversized iGPU that has good Linux drivers, but since AMD is not supporting ROCm on this GPU, there is no need to hurry on upgrading to a better GPU or to get my feet wet on running things locally on the GPU like Stable Diffusion, LLMs, etc.
AMD's definition of "support" I think is different than what people expect, and pretty misleading - ROCm itself will run on almost anything, back as far as the RX 400/500 series:
There are out-of-bounds writes in the BLAS libraries for gfx803 GPUs (such as the RX 570). That hardware might work fine for your use case, but there's a lot of failures in the test suites.
I agree that the official support list is very conservative, but I wouldn't recommend pre-Vega GPUs for use with ROCm. Stick to gfx900 and newer, if you can.
The last time I checked, I was stuck with a pretty old kernel if I wanted to have the last version of ROCm available for my rx470. It's compatible at some point in time, but not kept compatible with recent kernels.
AMD should focus their efforts on competitive hardware offerings, because that is where the need and the money is. Sorry, I don't think the hobbyist should be a priority.
>Nvidia can make things arbitrarily difficult both technically and legally.
I disagree. AMD can simply not implement those APIs, similar to how game emulators implement the most used APIs first and sometimes never bother implementing obscure ones. It would only matter that NVIDIA added eg. patented APIs to CUDA if those APIs were useful. In which case AMD should have a way to do them anyway. Unless NVIDIA comes up with a new patented API which is both useful and impossible to implement in any other way, which would be bad for AMD in any event. On the other hand, if AMD start supporting CUDA and people start using AMD cards, then developers will be hesitant to use APIs that only work on NVIDIA cards. Right now they are losing billions of dollars on this. Then again they barely seem capable of supporting RocM on their cards, much less CUDA.
You have a fair point in terms of cuDNN and cuBLAS but I don't know that that kind of ToS is actually binding.
You can patent the implementation. You can't patent the API name DecodeH265Video() but you can still sue someone for implementing that function correctly.
Agreed. Rather than making CUDA the standard; AMD should push/drive an open standard that can be run on any hardware.
We have seen this succeed multiple times: FreeSync vs GSync, DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal.
All of the big tech companies are obsessed with ring-fencing developers behind the thin veil of "innovation" - where really it's just good for business (I swear it should be regulated because it's really bad for consumers).
A CUDA translation layer is okay for now but it does risk CUDA becoming the standard API. Personally, I am comfortable with waiting on an open standard to take over - ROCm has serviced my needs pretty well so far.
Just wish GPU sharing with VMs was as easy as CPU sharing.
> AMD should push/drive an open standard that can be run on any hardware.
AMD has always been notoriously bad at the software side, and they frequently abandon their projects when they're almost usable, so I won't hold my breath.
we actually also saw this historically with openGL.
openGL comes from an ancient company whispered about by the elderly programmers (30 + year old) known as SGI. Originally it was CLOSED SOURCE and SGI called it "SGI-GL" for a computer codename IRIS which was cool looking with bright popping color plastic and faux granite keyboard. Good guy SGI open sourced SGI-GL to become what we called "openGL" (get it, now it's open), and then it stuck.
That's all to say NVIDIA could pull a SGI and open their stuff, but they're going more sony style and trying to monopolize. Oh, and SGI also wrote another ancient lore library known as "STL" or the "SGI Template Library" which is like the original boost template metaprogramming granddaddy
STL started even earlier, obviously without using the name "STL", as a library of generic algorithms for the programming language Ada (David R. Musser & Alexander A. Stepanov, 1987).
OpenCL was released in 2009. AMD has had plenty of time to push and drive that standard. But OpenCL had a worse experience than CUDA, and AMD wasn't up to the task in terms of hardware, so it made no real sense to go for OpenCL.
I agree with aspects of this take. In my original post I think that "should" is a strong word.
Realistically companies only have an obligation to make themselves profitable so really companies "should" only strive for profitability within the boundaries of the law above all else.
AMD have no obligation to drive an open standard, it's at their discretion to choose that approach - and it might actually come at the cost of profitability as it opens them up to competitors.
In this case - I believe that hardware & platform software companies that distribute a closed platform which cannot be genuinely justified as anything other than intending to prevent consumers from using competitor products "should" be moderated by regulator intervention as it results in a slower rate of innovation and poor outcomes for consumers.
That said, dreaming for the regulation of American tech giants is a pipe dream, haha.
> against the license agreement of cuDNN or cuBLAS to run them on this
They don’t run either of them, they instead implement an equivalent API on top of something else. Here’s a quote: “Open-source wrapper libraries providing the "CUDA-X" APIs by delegating to the corresponding ROCm libraries. This is how libraries such as cuBLAS and cuSOLVER are handled.”
No, it's stranger than that: SCOTUS did not rule on copyrightability of APIs at all, but simply ruled that even if they are copyrightable, what Google did (completely reimplement Sun/Oracle's public API) was still fair use.
It would have been nice to get a clear SCOTUS precedent on this. On the other hand, I also value a SCOTUS which rules minimally and narrowly by default (I also appreciate SCOTUS' return to stricter constitutional grounding in the past decade).
Incredibly loud laughing from the lawyers whose study of law is being thrown around willy nilly because of all the unprecedented joke decisions they are making right now.
We are stuck between a rock and a hard place politically. The real decisions should be coming from Congress not the courts. However, Congress is too disorganized and disconnected to answer the important questions, leaving the courts to either muddle along or else become semi-dictatorial. In most countries, this would cause a constitutional crisis, but the modern U.S. system seems to be a little too resilient to such otherwise concerning signals.
We're far past a constitutional crisis, and the courts taking power nobody wanted to give to them (who wasn't interested in a unitary executive at least) isn't a good solution.
What constitutional crisis has occurred that hasn't been resolved?
Constitutional crises involve fundamental breaks in the working of government that bring two or more of its elements into direct conflict that can't be reconciled through the normal means. The last of these by my accounting was over desegregation, which was resolved with the President ordering the Army to force the recalcitrant states to comply. Before that was a showdown between the New Deal Congress and the Supreme Court, which the former won by credibly threatening to pack the latter (which is IMO a much less severe crisis but still more substantial than anything happening today). However, that was almost a century ago, and Congress has not been that coherent lately.
I would think the latest one where SCOTUS ruled that the president was a king except in matters where the SCOTUS decides they aren't counts as a constitutional crisis.
Constitutional crises are not a matter of opinion but of occurrence, arising from an actual power conflict between arms of the government that is caused by a conflicted reading of the constitutional text. Basically, if the system just ticks on, it's not a constitutional crisis.
If "I think this is a very bad decision" was cause for a constitutional crisis, any state with more than three digit population would be in constitutional crisis perpetually.
> Constitutional crises are not a matter of opinion but of occurrence, arising from an actual power conflict between arms of the government that is caused by a conflicted reading of the constitutional text. Basically, if the system just ticks on, it's not a constitutional crisis.
This happened as recently as 2021-01-06; strong evidence that the military subverted the president to call the National Guard into Washington DC and secure the electoral count.
That's close. Both the excessively long lame duck period (2 months for Congress and 2.5 months for the President) and disunity between the President and the rest of the executive branch have also been fodder for crises in the past (Marbury v Madison, Andrew Johnson's impeachment).
That is how the SC used to work: they would decide cases on the narrowest possible grounds. If they don't have to decide a tough question, but they can finesse it with something simpler, good enough. More recently they have been willing to tear up decades of established law on a regular basis.
Yes, "used to". Now, in 2024, the same supreme court has decided that presidents have immunity in all official acts, from stealing documents, up to and including assassination attempts on their opponents. This is a radical shift in how the court operates.
This "opponent assassination" hypothetical gets bandied about a lot but I have not seen any evidence that any court considers that to be an "official act". Official acts are constrained to legitimate exercises of constitutional authority and are not merely anything a President (or especially, an ex-President) does.
the only thing radical is the opinions of people you are listening to if you believe SCOTUS enabled legally sanctioned assassinations. It was political hyperbole based on nothing, and it worked (with you). Think for yourself.
> CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult [...] technically.
(Let's put the legal questions aside for a moment.)
nVidia changes GPU architectures every generation / few generations, right? How does CUDA work across those—and how can it have forwards compatibility in the future—if it's not designed to be technologically agnostic?
PTX is meant to be portable across GPU microarchitectures. That said, Nvidia owns the entire spec, so they can just keep adding new instructions that their GPUs now support but AMD GPUs don't.
One way is to make sure the hardware team does certain things to support easy transition to new architectures, we have seen this with Apple Silicon for example!
Well, they kinda have it with their hipify tool, although this is for porting CUDA code to AMD's HIP which supports both AMD and NVIDIA. This supports CUDA C code and libraries with AMD equivalents like cuDNN, cuBLAS, cuRAND, but doesn't support porting of CUDA C inline PTX assembler. AMD have their own inline GCN assembler, but seem to discourage it's use.
There are also versions of PyTorch, TensorFlow and JAX with AMD support.
PyTorch's torch.compile can generate Triton (OpenAI's GPU compiler) kernels, with Triton also supporting AMD.
CUDA is the juice that built Nvidia in the AI space and allowed them to charge crazy money for their hardware. To be able to run CUDA on cost effective AMD hardware can be a big leap forward, allow more people to research, and break away from Nvidia's stranglehold over VRAM. Nvidia will never open source their own platform unless their hand is forced. I think we all should support this endeavor and contribute where possible.
Before starting, AMD signed an agreement with Intel that gave them an explicit license to x86. And x86 was a whole lot smaller and simpler back then in 1982. A completely different and incomparable situation.
Technically it was after starting - AMD was founded in 1969 as a second-sourcer for Fairchild and National Semiconductor, and had reverse-engineered the 8080 by 1975 and acquired a formal license to it by 1976.
The 1982 deal you speak of was actually pretty interesting: as a condition of the x86's use in the IBM PC, IBM requested a second source for x86 chips. AMD was that source, and so they cross-licensed the x86 in 1982 to allow the IBM PC project to proceed forward. This makes the Intel/AMD deal even more important for both companies: the PC market would never have developed without the cross-licensing, which would've been bad for all companies involved. This gave Intel an ongoing stake in AMD's success at least until the PC market consolidated on the x86 standard.
Was there a large entity steering x86 spec alone with a huge feature lead against their competition, free to steer the spec in any ways they choose? Also, hardware is not opensource software, you get big players onboard and they will be able to implement the spec they want every gen, software has more moving parts and unaligned parties involved.
I had't considered that angle. Is your point that Intel was the creator of x86, but software chose to support it, then AMD had nothing else but to play catch up in x86 support to be part of the software target market? If so and factual (I've no idea), fair point, I didn't know.
And Intel named its licenced implementation of AMD64 as IA-32e, just to make it clear to everyone that it is based on Intel architecture 32bit version with an extension.
Luckily they dropped that name few years later
Isn't cuDNN a much better case for reimplementing than CUDA? It has much more choice in how things actually happen and cuDNN itself chooses different implementations at runtime + does fusing. It seems way more generic and the reimplementation would allow using the best AMD-targeted kernel rather than one the original has.
And that thing is left for unreleased on windows for almost a whole year for unknown reason. Even though there is activity on github and build fix frequently. There is just no .exe or .msi for you to download. In fact, the rocm for linux is on major 6 release (which includes miopen). But somehow windows is still on major 5 (don't have miopen) for almost a whole year.
It almost make me wonder. Is there a shady trade somewhere to ask amd never release sdk for Windows to hike the price of nvidia card higher? Why they keep developing these without release it at all?
Since they cancelled the work on zluda and absolutely fail to do anything about other options, I really believe there's some "don't do it or you'll get sued to hell and back" agreement. They can't be so dumb they just miss it by accident.
I really hope they will do what you suggested. With some innovative product placement, GPUs with a lot of memory for example, they could dethrone nvidia if it doesn't change strategy.
That said, easier said than done. You need very specialized developers to build a CUDA equivalent and have people start using it. AMD could do it with a more open development process leveraging the open source community. I believe this will happen at some point anyway by AMD or someone else. The market just gets more attractive by the day and at some point the high entry barrier will not matter much.
So why should AMD skimp on their ambitions here? This would be a most sensible investment, few risks and high gains if successful.
That is why an open standard should be made so it isn't locked to a particular piece of hardware and then allow modular support for different hardware to interface with supported drivers.
Given AMDs prior lack of interest I'll take whatever options there are. My daily driver has a Vega 10 GPU and it's been quite frustrating not to be able to easily leverage it for doing basic ML tasks, to the point that I've been looking at buying an external nvidia GPU instead just to try out some of the popular Python libraries.
The US law is highly dependent on precedents. The Google-Oracle case has set one fortunately, so anything following it won't start from scratch. Fortunately we may not need a closer judge.
Google-Oracle side stepped the issue of API copyrightability by saying Google's particular implementation would fall under fair use. Whether APIs are copyrightable remains an open question.
Ya, honestly better to leave that to third parties who can dedicate themselves to it and maybe offer support or whatever. Let AMD work on good first party support first.
I don't really see how any code that depends heavily on the underlying hardware can "just work" on AMD. Most serious CUDA code is aware of register file and shared memory sizes, wgmma instructions, optimal tensor core memory & register layouts, tensor memory accelerator instructions, etc...
Presumably that stuff doesn't "just work" but they don't want to mention it?
A lot of our hw-aware bits are parameterized where we fill in constants based on the available hw
. Doable to port, same as we do whenever new Nvidia architectures come out.
But yeah, we have tricky bits that inline PTX, and.. that will be more annoying to redo.
Oh that will be interesting to understand, as PTX gets to more about trickier hw-arch-specific phenomena that diff brands disagree on, like memory models. Neat!
Looks like the PTX translation is via another project ZLUDA, though how they bridge the differences in memory/consistency/etc models safely remains unclear to me...
SCALE does not use any part of ZLUDA. We have modified the clang frontend to convert inline PTX asm block to LLVM IR.
To put in a less compiler-engineer-ey way: for any given block of PTX, there exists a hypothetical sequence of C++/CUDA code you could have written to achieve the same effect, but on AMD (perhaps using funky __builtin_... functions if the code includes shuffles/ballots/other-weird-gpu-stuff). Our compiler effectively converts the PTX into that hypothetical C++.
Regarding memory consistency etc.: NVIDIA document the "CUDA memory consistency model" extremely thoroughly, and likewise, the consistency guarantees for PTX. It is therefore sufficient to ensure that we use operations at least as synchronising as those called for in the documented semantics of the language (be it CUDA or PTX, for each operation).
Differing consistency _between architectures_ is the AMDGPU backend's problem.
Again, I'm guessing there might be an equiv simpler program involving AMD's __ballot, but I'm unsure of the true equivalence wrt safety, and it seems like a tricky rewrite as it needs to (afaict) decompile to recover the higher-level abstraction. Normally it's easier to compile down or sideways (translate), and it's not clear to me these primitives are 1:1 for safely doing so.
===
FWIW, this is all pretty cool. We stay away from PTX -- most of our app code is higher-level, whether RAPIDS (GPU dataframes, GPU ML, etc libs), minimal cuda, and minimal opencl, with only small traces of inline ptx. So more realistically, if we had the motivation, we'd likely explore just #ifdef'ing it with something predictable.
It's supposed to be waiting for all threads to vote
I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.
Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.
So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.
It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.
Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior?
Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.
But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!
Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.
If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).
it's a speculation, but I think it's similar with processors = nobody guarantees the code will run the way you set it up. You may want to use some specific register but if the processor will think it has another register that can fulfill the task, it'll use that but tell you that your code is executed as expected. Maybe the internal gpu processor of amd can sufficiently simulate the behavior of nvidia hardware so that higher abstractions will be unaware that something different is happening under the hood
Prettymuch. Compilers can do a lot more than people give them credit for. At least AMD document their hardware so it is actually possible to know low-level details. PTX can obfuscate that surprisingly badly for nvidia targets.
Makes sense to expect this kind of thing to be open source. The whole point of providing improved compatibility is to make people’s lives easier, and open source is usually an important feature to ensure wide compatibility. It also means projects can live on after the creators move to other things, people can submit patches for important features or bug fixes, and generally makes the system much more useful.
I don't find it wrong for someone to attempt to make money back on their time and experience of doing the work. I don't mind people that offer that back as open source either. However, I do have a problem of people expecting everything to be open/free, especially those that then go on a crusade chastising those that do try to make money.
I'm really trying to keep this about the engineering features of a system rather than moral judgments. Open source systems are simply more flexible and adaptable than proprietary systems, which have their own benefits. In today's world, the engineering value of open source systems is becoming so important that people are looking for other ways to provide for the developers creating these systems. It can be surprising when a project creator builds something in an area that is usually all open source, but they choose a proprietary path. Just look at the problems created by NVIDIA for their use of proprietary software in CUDA and their GPUs. This software is an attempt to fix issues created by proprietary software with another piece of proprietary software, which is if nothing else an interesting decision.
UNIX wasn't free. Windows wasn't free. It wasn't until some knucklehead came along and did something abnormal and gave away their thing. Bakers don't give away their goods. Mechanics don't typically repair things for free. Builders don't build things for free. Gas stations don't give away gas.
Why do we think all software should be free, and then think that those that don't give it away are the abnormal ones?
Because software is information. It is closer to a scientific paper than a loaf of bread, and I do expect those to be free. I do not expect scientists to work for free, but the marginal cost of copying their output is 0 and the social benefit is huge.
Free software, like open science, clearly has something going for it pragmatically. The developer hours put into it have paid for themselves magnitudes of times over. Megacorps hire people to work on free software. If you can't see the value, that's a you problem.
Free software is so important to society that I believe the most reasonable solution is to provide for all people without their need to work for survival. Automate as much as possible such that work is not compulsory, and enough people simply want something to do (and possibly additional pay depending on how the system is arranged) that everything that needs to get done by people does get done.
For now that is fiction, but so is "if all software was free". I do think though that both would lead to a faster rate of innovation in society versus one where critical information is withheld from society to pay someone's rent and food bills.
Most software is free and makes no money - and that has always been the case. There are some very popular and widely-used non-free systems, but most software isn't that, and its developers still pay the bills.
This is somewhat analogous to music or books/literature. Most composers and performers and authors make no money from people copying and sharing their works. Some pay the bills working professionally for entities who want their product enough to pay for it; some do other things in life. Some indeed give up their work on music because they can't afford to not do more gainful work. And still, neither music nor books go away as copying them gets closer to being free.
If my current employer can't make any money from the code we write, then it would collapse faster than a soufflé taken out of the oven too early, and I would be out of a job
It will be interesting to see if this is the case in the long run, assuming "huge" has a positive connotation in your post, of course.
If AGI comes to pass and it winds up being a net negative for humanity, then the ethics of any practice which involves freely distributing information that can be endlessly copied for very little cost must be reevaluated.
Increasingly, I am not putting much weight in any predictions about whether this will happen in the way we think it will, or what it could possibly mean. We might as well be talking about the rapture.
Why do people return Windows laptops when they have to pay for a Windows License Activation? Because every single OEM pays for it; you don't expect to buy Windows because it is a failed B2C business model. Nobody wants it. Same goes for proprietary UNIX, and people wish it was the case for Nvidia drivers. I own CUDA hardware and lament the fact that cross-industry GPGPU died so FAANG could sell licensed AI SDKs. The only thing stopping AI from being "free" is the limitations OEMs impose on their hardware.
> that those that don't give it away are the abnormal ones?
They are. Admit it; the internet is the new normal, if your software isn't as "free" as opening a website, you're weird. If I have to pay to access your little forum, I won't use it. If I have to buy your app to see what it's like, I'll never know what you're offering. Part of what makes Nvidia's business model so successful is that they do "give away" CUDA to anyone that owns their hardware. There is no developer fee or mandatory licensing cost, it is plug-and-play with the hardware. Same goes for OpenAI, they'd have never succeeded if you had to buy "the ChatGPT App" from your App Store.
> Why do people return Windows laptops when they have to pay for a Windows License Activation?
The internet echo chamber strikes again. Exactly how many people are actually doing this? Not many, and those that are all hangout together. The rest of the world just blindly goes about their day using Windows while surfing the web using Chrome. Sometimes, it's a good thing to get outside your bubble. It's a big world out there, and not everybody sees the world as you do
> The rest of the world just blindly goes about their day using Windows while surfing the web using Chrome.
Paying for Windows? I think you missed my point. If your computer doesn't ship with an OS, paid or otherwise, people think it's a glitch. The average consumer will sooner return their laptop before they buy a license of Windows, create an Install Media from their old device and flash the new hardware with a purchased license. They'll get a Chromebook instead, people don't buy Windows today.
The internet has conditioned the majority of modern technology users to reject and habitually avoid non-free experiences. Ad-enabled free platforms and their pervasive success is all the evidence you need. Commercial software as it existed 20 or 30 years ago is a dead business. Free reigns supreme.
That is kind of his point. You don't, Windows is bundled with laptop. It is not that I agree with his points. Windows for example isn't open source in remotest sense
Dell offers laptops with a version of Linux preinstalled and supports them. System76, Lenovo, Purism as well to name a few. Apple also sells laptops without Windows on them. There are actually quite a few options that do this. If you don't want Windows, we have options now. Yes, historically, it was Windows or Apple's OS, but that's no longer true and not recognizing that just makes you look like you're pushing a false narrative on the situation for what purpose only you know.
> Commercial software as it existed 20 or 30 years ago is a dead business. Free reigns supreme.
What nonsense. Go into any business and you will find every single piece of software they use is bought and paid for with bells on. The 'Free World' you speak of is only there to get you, an individual, used to using the software so that businesses are made to purchase it. In the old days we called this 'demo' or 'shareware'. Now its 'free' or 'personal' tier subscription.
Go and ask any designer if their copy of Adobe Creative Cloud, 3d studio Max, or AutoCAD is free. Any office worker if Micsrosoft Office(including Teams and Sharedpoint etc) or even google docs for business. Majority of developers are running paid versions of Jetbrains. Running an online shop? Chances are you are paying for shopify software, or something like Zoho to manage your customers and orders.
'Free' as you put it is very much only in the online individual consumer world, a very small part of the software world.
The commercial software market is more alive and expensive than it has ever been.
> Bakers don't give away their goods. Mechanics don't typically repair things for free. Builders don't build things for free. Gas stations don't give away gas.
These all have the property which is that they are scarce physical goods or services. Software is not scarce (though of course the labor to create it is), so this is a really bad comparison.
And again I did not say it should or should not be free, I said there are engineering benefits to open source software and more and more people recognize those benefits and choose to make things free because they see the value and are willing to recognize the tradeoffs. I never said what "should" be done. "Should" is kind of a nonsense term when used in this way as it hides a lot of assumptions, so I generally do not use it, and notably did not use it in my comment. I want to point out the peculiarity in your rather strong response to a word and concept I never used. I think you are having an argument with imagined people, not a discussion with me.
And for what it is worth, I am a robotics engineer and I am designing a completely open source solar powered farming robot designed to be made in a small shop in any city in the world (see my profile), funded by a wealthy robotics entrepreneur who recognizes the value in making this technology available to people all over the world.
So I am one of those engineers making this choice, and not someone just asking for things without doing the same of my work. Everything I produce is open source, including person projects and even my personal writing.
Otoh recepies and drawings are commonly available for free. So if you can support yourself the cake and engine repair is free. But if you need support then you can get someone to bake or build for you.
> Makes sense to expect this kind of thing to be open source. The whole point of providing improved compatibility is to make people’s lives easier, and open source is usually an important feature to ensure wide compatibility. It also means projects can live on after the creator
AMD just bought company working with similar things for more than 600m.
We're going to be publishing more details on later blog posts and documentation about how this works and how we've built it.
Yes, we're not open source, however our license is very permissive. It's both in the software distribution and viewable online at https://docs.scale-lang.com/licensing/
> I'm curious, for what reasons are you interested in the source code yourself?
I am the founder/editor of PLDB. So I try to do my best to help people "build the next great programming language".
We clone the git repos of over 1,000 compilers and interpreters and use cloc to determine what languages the people who are building languages are using. The people who build languages obviously are the experts, so how they go so goes the world.
We call this measurement "Foundation Score". A Foundation Score of 100 means 100 other languages uses this language somehow in their primary implementation.
It is utterly dominated by open source languages, and the disparity is only getting more extreme.
Some that might have become irrelevant have gained a second wind after going open source.
But some keep falling further behind.
I look at Mathematica, a very powerful and amazing language, and it makes me sad to see so few other language designers using it, and the reason is because its closed source. So they are not doing so hot, and that's a language from one of our world's smartest and most prolific thinkers that's been around for decades.
I don't see a way for a new language to catch on nowadays that is not open source.
We do believe in open source software and we do want to move the GPGPU market away from fully closed languages. The future is open for discussion but regardless, the status-quo at the moment is a proprietary and dominant implementation which only supports a single vendor.
> I don't see a way for a new language to catch on nowadays that is not open source.
I do note that CUDA is itself closed source -- while there's an open source implementation in the LLVM project, it is not as bleeding edge as NVIDIA's own.
And this is a good point. However, it also has a 17 year head start, and many of those years were spent developing before people realized what a huge market there was.
All it will take is one committed genius to create an open source alternative to CUDA to dethrone it.
But they would have to have some Mojo (hint hint) to pull that off.
I'm not the person you replied to, and I can't speak for them. But I can say that for myself, and a not small number of other people, it's an ideological issue. I simply do not use software that isn't F/OSS - to the greatest extent that that is possible. For me, I might use a VERY small amount of non F/OSS stuff, but it's very hard to get me to adopt something new if it isn't.
Now should you make business decisions based on that? Probably not. But while I don't claim to be a representative sample, I am pretty sure the number of people who share my beliefs in this regard is substantially "non zero". shrug
I'm a big fan of opensource for most things but if what you've got actually works, you could probably earn big money selling it. The biggest companies in the world are building / using this sort of thing.
Imagine the shift of capital if for example, Intel GPUS suddenly had the same ML software compatibility as Nvidia
This. Since Intel and AMD weren't able to produce a good solution to nvidia's moat yet, this should be worth serious money to them. No need to give it away for free.
On the other hand, if they want better adoption (which would drive sales of their hardware) then Intel / AMD should make a deal to release it as opensource. Closed source will make some profit, but not that much. If this thing really means that everything can run on AMD GPU cards today, then this is a game changer and is worth a lot.
> I don't see a "buy now" button or a PCIe version anywhere here
"Buy now" buttons and online shopping carts are not generally how organizations looking to spend serious money on AI buy their hardware.
They have a long list of server hardware partners, and odds are you'd already have an existing relationship with one or more of them, and they'd provide a quote.
They even go one step further and show off some of their partners' solutions:
FWIW I believe Supermicro and Exxact actually do have web-based shopping carts these days, so maybe you could skip the quotation and buy directly if you were so motivated? Seems kind of weird at this price point.
They could break the trend and offer a "buy now" button instead of offering quotes and coffee chats. It's very likely that will kickstart the software snowball with early adopters.
Nobody is going to drop millions on an unproven platform.
> Seems kind of weird at this price point.
Yeah that $234K server is too much for people to do a trial. It has 8xMI300X GPUs along with a bunch of other shit.
Give me a single MI300X GPU in PCIe form factor for $20K and I'd very seriously consider. I'm sure there are many people who would help adapt the ecosystem if they were truly available.
> Give me a single MI300X GPU in PCIe form factor for $20K and I'd very seriously consider. I'm sure there are many people who would help adapt the ecosystem if they were truly available.
I know this isn't what you're looking for entirely, but my business, Hot Aisle, is working on making MI300x available for rental. Our pricing isn't too crazy given that the GPU has 192GB and one week minimum isn't too bad. We will add on-demand hourly pricing as soon as we technically can.
I'm also pushing hard on Dell and AMD to pre-purchase developer credits on our hardware, that we can then give away to people who want to "kick the tires".
Why would you be looking to dip your toe into the AMD ecosystem for the first time using an MI300X? It doesn't make any sense. It's not entry level hardware.
I'm not looking to enter the ecosystem, I'm already deep in it and want to fix the AMD problem so that I can build big projects around it and undercut everyone who's using Nvidia.
You can purchase H100 and A100 PCIe cards over the counter. They're great for compiling CUDA code, testing code before you launch a multi-node job into a cluster, and for running evaluations.
AMD has nothing of the sort, and it's hurting them.
I cannot blow 250K on an SMCI server, nor do I have the electricity setup for it. I can blow 20K on a PCIe GPU and start contributing to the ecosystem, or maybe prove out an idea on one GPU before trying to raise millions from a VC to build a more cost-effective datacenter that actually works.
The main cause of Nvidia's crazy valuation is AMD's unwillingness to invest in making its GPUs as useful as Nvidia's for ML.
Maybe AMD fears antitrust action, or maybe there is something about its underlying hardware approach that would limit competitiveness, but the company seems to have left billions of dollars on the table during the crypto mining GPU demand spike and now during the AI boom demand spike.
I like to watch YouTube retrospectives on old failed tech companies - LGR has some good ones.
When I think of AMD ignoring machine learning, I can't help imagine a future YouTuber's voiceover explaining how this caused their downfall.
There's a tendency sometimes to think "they know what they're doing, they must have good reasons". And sometimes that's right, and sometimes that's wrong. Perhaps there's some great technical, legal, or economic reason I'm just not aware of. But when you actually look into these things, it's surprising how often the answer is indeed just shortsightedness.
They could end up like BlackBerry, Blockbuster, Nokia, and Kodak. I guess it's not quite as severe, since they will still have a market in games and therefore may well continue to exist, but it will still be looked back on as a colossal mistake.
Same with Toyota ignoring electric cars.
I'm not an investor, but I still have stakes in the sense that Nvidia has no significant competition in the machine learning space, and that sucks. GPU prices are sky high and there's nobody else to turn to if there's something about Nvidia you just don't like or if they decide to screw us.
In fairness to AMD, they bet on crypto, and nvidia bet on AI. Crypto was the right short term bet.
Also, ignoring is a strong word: I’m staring at a little << $1000, silent 53 watt mini-PC with an AMD SoC. It has an NPU comparable to an M1. In a few months, with the ryzen 9000 series, NPUs for devices of its class will bump from 16 tops to 50 tops.
I’m pretty sure the linux taint bit is off, and everything just worked out of the box.
Toyota is extremely strong in the hybrid car market, and with ravenous competition for electric cars and slowing demand Toyota may have made the right decision after all
There's also just the idea of endeavour - Nvidia tried something, and it worked. Businesses (or rather their shareholders) take risks with their capital sometimes, and it doesn't always work. But in this case it did.
> I think this could be cultural differences, AMD's software department is underfunded and doing poorly for a long time now.
Rumor is that ML engineers (that AMD really needs) are expensive; and AMD doesn't want to give them more money than the rest of the SWEs they have (for pissing off the existing SWEs). So AMD is caught in a bind: can't pay to get top MLE talent and can't just sit by and watch NVDA eat its lunch.
> So AMD is caught in a bind: can't pay to get top MLE talent and can't just sit by and watch NVDA eat its lunch.
This isn't being caught in a bind. This is, if true, just making a poor decision. Nothing is really preventing them from paying more for specialized work.
I find this strange to believe. Every big company has levels, unless your existing L7+ IC is below market, you can just pull L7+ salaried ML engineers with some secret signing bonus like literally everyone else.
The dirty secret in the tech industry is that most people at AMD or Intel or IBM and historically Nvidia/Oracle (this changed post 2022), were the 2nd-3rd tier tech companies. Staffed heavily by the rejects of the FAANG, they were still happy to have their 100-200K in their MCOL areas, but no free food and a much more boring work culture. Intel's "great place to work" corporate propaganda was known as "great place to leetcode" while I worked there, as Intel was always seen as a stepping stone before you "made it" in a FAANG.
Culturally, none of these companies were happy to pay anyone except the tip, top "distinguished" engineers more than 300K. AMD seems to be stuck in this mentality, just as IBM is.
> AMD seems to be stuck in this mentality, just as IBM is.
And that's why creative destruction is essential for technological progress. It's common for organizations to get stuck in stable-but-suboptimal social equilibria: everyone knows there's a problem but nobody can fix it. The only way out is to make a new organization and let the old one die.
AMD fears anti-collusion action, remember, CEOs of the two are just barely far enough of kinship to not be automatically considered colluding with each other.
The companies' CEO's are related. My conspiracy theory is that they don't want to step on each other's toes. Not sure if that works with fiduciary duty, though.
It does not conflict. Fiduciary duty for a for-profit organisation is not "profit at all costs", it's "you have to care about the company (care), you have to do good business (good faith) and you can't actively waste investors' and shareholders' money to intentionally lose out (loyalty)".
If they are found colluding due to nepotism, both will get a very swift revocation of business licence and a huge prison term. Remember they are just one step of kinship away from presumed collusion.
I worked for spectral compute a few years ago. Very smart and capable technical team.
At the time, not only did they target AMD (with less compatibility than they have now), but also outperformed the default LLVM ptx backend, and even NVCC, when compiling for Nvidia GPUs!
I don't understand how AMD has messed up so badly that I feel like celebrating a project like this. Features of my laptop are just physically there but not usable, particularly in Linux. So frustrating.
AMD hardware works fine, the problem is that the major research projects everyone copies are all developed specifically for Nvidia.
Now AMD is spinning up CUDA compatibility layer after CUDA compatibility layer. It's like trying to beat Windows by building another ReactOS/Wine. It's an approach doomed to fail unless AMD somehow manages to gain vastly more resources than the competition.
Apple's NPU may not be very powerful, but many models have been altered specifically to run on them, making their NPUs vastly more useful than most equivalently powerful iGPUs. AMD doesn't have that just yet, they're always catching up.
It'll be interesting to see what Qualcomm will do to get developers to make use of their NPUs on the new laptop chips.
I don't know if I would call it a mess up. AMD still has massive market in server chips, and their ARM stuff is on the horizon. We all assume that graphics cards are the way forward for ML, which may not be the case in the future.
Nvidia were just ahead in this particular category due to CUDA, so AMD may have just let them run with it for now.
Same boat, AMD CPU but nothing else. I feel like a moderate improvement of their FOSS support, drivers would open new hardware revenue - to say nothing about the AI channel.
It’s great that there is a page about current limitations [1], but I am afraid that what most people describe as “CUDA” is a small subset of the real CUDA functionality. Would be great to have a comparison table for advanced features like warp shuffles, atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.
You're right that most people only use a small subset of cuda: we prioritied support for features based on what was needed for various open-source projects, as a way to try to capture the most common things first.
A complete API comparison table is coming soon, I belive. :D
In a nutshell:
- DPX: Yes.
- Shuffles: Yes. Including the PTX versions, with all their weird/wacky/insane arguments.
- Atomics: yes, except the 128-bit atomics nvidia added very recently.
- MMA: in development, though of course we can't fix the fact that nvidia's hardware in this area is just better than AMD's, so don't expect performance to be as good in all cases.
- TMA: On the same branch as MMA, though it'll just be using AMD's async copy instructions.
> mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.
We plan to publish a compatibility table of which instructons are supported, but a list of the instructions used to produce each PTX instruction is not in general meaningful. The inline PTX handler works by converting the PTX block to LLVM IR at the start of compilation (at the same time the rest of your code gets turned into IR), so it then "compiles forward" with the rest of the program. As a result, the actual instructions chosen vary on a csae-by-case basis due to the whims of the optimiser. This design in principle produces better performance than a hypothetical solution that turned PTX asm into AMD asm, because it conveniently eliminates the optimisation barrier an asm block typically represents. Care, of course, is taken to handle the wacky memory consistency concerns that this implies!
We're documenting which ones are expected to perform worse than on NVIDIA, though!
> You're right that most people only use a small subset of cuda
This is true first and foremost for the host-side API. From my StackOverflow and NVIDIA forums experience - I'm often the first and only person to ask about any number of nooks and crannies of the CUDA Driver API, with issues which nobody seems to have stumbled onto before; or at least - not stumbled and wrote anything in public about it.
Oh yes, we found all kinds of bugs in Nvidia's cuda implementation during this project :D.
There's a bunch of pretty obscure functions in the device side apis too: some esoteric math functions, old simd "intrinsics" that are mostly irrelevant with modern compilers, etc.
But I can't help but think if something like this can be done to this extend, I wonder what went wrong/why it's a struggle for OpenCL to unify the two fragmentized communities. While this is very practical and has a significant impact for people who develop GPGPU/AI applications, for the heterogeneous computing community as a whole, relying on/promoting a proprietary interface/API/language to become THE interface to work with different GPUs sounds like bad news.
Can someone educate me on why OpenCL seems to be out of scene in the comments/any of the recent discussions related to this topic?
Opencl gives you the subset of capability that a lot of different companies were confident they could implement. That subset turns out to be intensely annoying to program in - it's just the compiler saying no over and over again.
Or you can compile as freestanding c++ with clang extensions and it works much like a CPU does. Or you can compile as cuda or openmp and most stuff you write actually turns into code, not a semantic error.
Currently cuda holds lead position but it should lose that place because it's horrible to work in (and to a lesser extent because more than one company knows how to make a GPU). Openmp is an interesting alternative - need to be a little careful to get fast code out but lots of things work somewhat intuitively.
Personally, I think raw C++ is going to win out and the many heterogeneous languages will ultimately be dropped as basically a bad idea. But time will tell. Opencl looks very DoA.
ZLUDA is pretty good, except that it lacks cuDNN which makes most PyTorch projects just not work. Not sure if this project does cover that? That could be a game changer, otherwise yeah ZLUDA is the better open-source option.
This is technically feasible so might be the real thing. Parsing inline ptx and mapping that onto amdgpu would be a huge pain.
Working from cuda source that doesn't use inline ptx to target amdgpu is roughly regex find and replace to get hip, which has implemented pretty much the same functionality.
Some of the details would be dubious, e.g. the atomic models probably don't match, and volta has a different instruction pointer model, but it could all be done correctly.
Amd won't do this. Cuda isn't a very nice thing in general and the legal team would have kittens. But other people totally could.
Mapping inline ptx to AMD machine code would indeed suck. Converting it to LLVM IR right at the start of compilation (when the initial IR is being generated) is much simpler, since it then gets "compiled forward" with the rest of the code. It's as if you wrote C++/intrinsics/whatever instead.
Note that nvcc accepts a different dialect of C++ from clang (and hence hipcc), so there is in fact more that separates CUDA from hip (at the language level) than just find/replace. We discuss this a little in [the manual](https://docs.scale-lang.com/manual/dialects/)
Handling differences between the atomic models is, indeed, "fun". But since CUDA is a programming language with documented semantics for its memory consistency (and so is PTX) it is entirely possible to arrange for the compiler to "play by NVIDIA's rules".
Huh. Inline assembly is strongly associated in my mind with writing things that can't be represented in LLVM IR, but in the specific case of PTX - you can only write things that ptxas understands, and that probably rules out wide classes of horrendous behaviour. Raw bytes being used for instructions and for data, ad hoc self modifying code and so forth.
I believe nvcc is roughly an antique clang build hacked out of all recognition. I remember it rejecting templates with 'I' as the type name and working when changing to 'T', nonsense like that. The HIP language probably corresponds pretty closely to clang's cuda implementation in terms of semantics (a lot of the control flow in clang treats them identically), but I don't believe an exact match to nvcc was considered particularly necessary for the clang -x cuda work.
The ptx to llvm IR approach is clever. I think upstream would be game for that, feel free to tag me on reviews if you want to get that divergence out of your local codebase.
I certainly would not attempt this feat with x86 `asm` blocks :D. PTX is indeed very pedestrian: it's more like IR than machine code, really. All the usual "machine-level craziness" that would otherwise make this impossible is just unrepresentable in PTX (though you do run into cases of "oopsie, AMD don't have hardware for this so we have to do something insane").
It's a beautiful answer to a deeply annoying language feature. I absolutely love it. Yes, inline asm containing PTX definitely should be burned off at the compiler front end, regardless of whether it ultimately codegens as PTX or something else.
I'm spawned a thread on the llvm board asking if anyone else wants that as a feature https://discourse.llvm.org/t/fexpand-inline-ptx-as-a-feature... in the upstream. That doesn't feel great - you've done something clever in a proprietary compiler and I'm suggesting upstream reimplement it - so I hope that doesn't cause you any distress. AMD is relatively unlikely to greenlight me writing it so it's probably just more marketing unless other people are keen to parse asm in string literals.
nvcc is nowhere near that bad these days, it supports most C++ code directly (for example, I've written kernels that include headers like <span> or <algorithm> and they work just fine).
NVCC is doing much better than before in terms of "broken C++". There was indeed a time when lots of modern C++ just didn't work.
Nowadays the issues are more subtle and nasty. Subtle differences in overload resolution. Subtle differences in lambda handling. Enough to break code in "spicy" ways when you try to port it over.
What do you think the source of this is? My understanding was that Nvidia is basically adopting the clang frontend wholesale now so I'm curious where it differs.
At my workplace, we were reluctant in making the choice between writing OpenCL and being AMD-compliant, but missing out on CUDA features and tooling; and writing CUDA and being vendor-locked.
Our jerry-rigged solution for now is writing kernels that are the same source for both OpenCL and CUDA, with a few macros doing a bit of adaptation (e.g. the syntax for constructing a struct). This requires no special library or complicated runtime work - but it does have the downside of forcing our code to be C'ish rather than C++'ish, which is quite annoying if you want to write anything that's templated.
Note that all of this regards device-side, not host-side, code. For the host-side, I would like, at some point, to take the modern-C++ CUDA API wrappers (https://github.com/eyalroz/cuda-api-wrappers/) and derive from them something which supports CUDA, OpenCL and maybe HIP/ROCm. Unfortunately, I don't have the free time to do this on my own, so if anyone is interested in collaborating on something like that, please drop me a line.
-----
You can find the OpenCL-that-is-also-CUDA mechanism at:
the real question here is whether anybody has gotten cheap, easily available AMD GPUs to run their AI workloads, and if we can predict more people will do so
I ported Karparthy's llm.c repo to AMD devices [1], and have trained GPT2 from scratch with 10B tokens of fineweb-edu on a 4x 7900XTX machine in just a few hours (about $2 worth of electricity) [2].
I've also trained the larger GPT2-XL model from scratch on bigger CDNA machines.
That's not important if the goal is to run existing CUDA code on AMD GPUs. All you have to do is write portable CUDA code in the future regardless of what Nvidia does if you want to keep writing CUDA.
I don't know the economics here, but if the AMD provides a significant cost saving, companies are going to make it work.
> Nvidia can always add things to make it difficult
Sounds like Microsoft embedding the browser in the OS. It's hard to see how doing something like that wouldn't trigger an antitrust case.
HIP works very similarly. Install rocm from your Linux distribution or from amd's repo, or build it from github.com/rocm. Has the nice feature of being pure userspace if you use the driver version that's already in your kernel.
How turn-key / happy an experience that is depends on how closely your system correlates with one of the documented/tested distro versions and what GPU you have. If it's one that doesn't have binary versions of rocblas etc in the binary blob, either build rocm from source or don't bother with rocblas.
Ok, so I just stumbled on the problem, that I tried out openwhisper (from OpenAI), but on my CPU, because of no CUDA and workarounds seem hacky. So the headline sounds good!
But can this help me directly? Or would OpenAI have to use this tool for me to benefit?
It is not immediately clear to me (but I am a beginner in this space).
It doesn't matter though. NVIDIA distributes tons of libraries built atop CUDA that you cannot distribute or use on AMD chips legally. Cutlass, CuBLAS, NCCL, etc.
SCALE doesn't use cuBlas and friends. For those APIs, it uses either its own implementations of the functions, or delegates to an existing AMD library (such as rocblas).
It wouldn't even be technically possible for SCALE to distribute and use cuBlas, since the source code is not available. I suppose maybe you could do distribute cuBlas and run it through ZLUDA, but that would likely become legally troublesome.
> SCALE doesn't use cuBlas and friends. For those APIs, it uses either its own implementations of the functions, or delegates to an existing AMD library (such as rocblas).
And this is the problem. I guarantee you NVIDIA has more engineers working on cuBLAS et al than AMD does.
The NVIDIA moat is not CUDA the language or CUDA the library. It's CUDA the ecosystem. That means things like all the high performance libraries; all the high performance libraries with clustering support (does AMD even have a clustering solution like NVLink -- everyone forgets that NVIDIA also does high speed networking); all the high perf appliances (everyone also forgets that NVIDIA sells entire systems, not GPUS); all the high perf servers (Triton inference server, etc). We can go on.
I commend the project volunteers for what they've done, but I would recommend getting VC money and competing directly with NVIDIA.
One question I always have about these sorts of translation layers is how they deal with the different warp sizes. I'd imagine a lot of CUDA code relies on 32-wide warps, while as far as I know AMD tends to have 64-wide warps. Is there some sort of emulation that needs to happen?
The older AMD GCN had 64-wide wavefront, but the newer AMD GPUs "RDNA" support both 64 and 32 wavefront, and this is configurable at runtime. It appears the narrower wavefronts are better suited for games in general.
Not sure what is the situation with "CDNA", which is the compute-oriented evolution of "GCN", i.e. whether CDNA is 64-wavefront only or dual like RNDA.
The future is inference. Many inference stacks already support AMD although the kernels are less optimized. This will of course change over time, but if AMD can crack the inference demand, it will put NVDA under huge pressure.
It appears we implemented `--threads` but not `-t` for the compiler flag. Oeps. In either case, the flag has no effect at present, since fatbinary support is still in development, and that's the only part of the process that could conceivably be parallelised.
That said: clang (and hence the SCALE compiler) tends to compile CUDA much faster than nvcc does, so this lack of the parallelism feature is less problematic than it might at first seem.
NVTX support (if you want more than just "no-ops to make the code compile") requires cooperation with the authors of profilers etc., which has not so far been available
bfloat16 is not properly supported by AMD anyway: the hardware doesn't do it, and HIP's implementatoin just lies and does the math in `float`. For that reason we haven't prioritised putting together the API.
Hi, why do you believe that bfloat16 is not supported? Can you please provide some references (specifically the part about the hardware "doesn't do it")?
For the hardware you are focussing on (gfx11), the reference manual [2] and the list of LLVM gfx11 instructions supported [1] describe the bfloat16 vdot & WMMA operations, and these are in fact implemented and working in various software such as composable kernels and rocBLAS, which I have used (and can guarantee they are not simply being run as float). I've also used these in the AMD fork of llm.c [3]
Outside of gfx11, I have also used bfloat16 in CDNA2 & 3 devices, and they are working and being supported.
Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?
> Hi, why do you believe that bfloat16 is not supported?
Apologies, I appear to be talking nonsense. I conflated bfloat16 with nvidia's other wacky floating point formats. This is probably my cue to stop answering reddit/HN comments and go to bed. :D
So: ahem: bfloat16 support is basically just missing the fairly boring header.
> Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?
Prettymuch that, yes. Not much point reimplementing all the math libraries when AMD is doing that part of the legwork already.
OK, so in the case of llm.c, if you're just including the HIP headers, using hipblasLt, etc, what would be the benefit of using scale instead of hipify?
Really, really, really curious as to how they managed to pull this off, if their project works as well as they claim it does. If stuff as complex as paged/flash attention can "just work", this is really cool.
My understanding from chatting with them is that tensor core operations aren't supported yet, so FlashAttention likely won't work. I think its on their to-do list though!
Nvidia actually has more and more capable matrix multiplication units, so even with a translation layer I wouldn't expect the same performance until AMD produces better ML cards.
Additionally, these kernels usually have high sensitivity to cache and smem sizes, so they might need to be retuned.
So the only part that anyone actually cares about, as usual, is not supported. Same story as it was in 2012 with AMD vs Nvidia (and likely much before that too!). The more things change, the more they stay the same.
Cuda is a programming language. You implement it like any other. The docs are a bit sparse but not awful. Targeting amdgpu is probably about as difficult as targeting x64, mostly changes the compiler runtime.
The online ptx implementation is notable for being even more annoying to deal with than the cuda, but it's just bytes in / different bytes out. No magic.
CUDA has a couple of extra problems beyond just any other programming language:
- CUDA is more than a language: it's a giant library (for both CPU and GPU) for interacting with the GPU, and for writing the GPU code. This needed reimplementing. At least for the device-side stuff we can implement it in CUDA, so when we add support for other GPU vendors the code can (mostly) just be recompiled and work there :D.
- CUDA (the language) is not actually specified. It is, informally, "whatever nvcc does". This differs significantly from what Clang's CUDA support does (which is ultimately what the HIP compiler is derived from).
The openmp device runtime library was originally written in cuda. I ported that to hip for amdgpu, discovered the upstream hip compiler wasn't quite as solid as advertised, then ported it to openmp with some compiler intrinsics. The languages are all essentially C++ syntax with some spurious noise obfuscating llvm IR. The libc effort has gone with freestanding c++ based on that experience and and we've now mostly fixed the ways that goes wrong.
You might also find raw c++ for device libraries saner to deal with than cuda. In particular you don't need to jury rig the thing to not spuriously embed the GPU code in x64 elf objects and/or pull the binaries apart. Though if you're feeding the same device libraries to nvcc with #ifdef around the divergence your hands are tied.
> You might also find raw c++ for device libraries saner to deal with than cuda.
Actually, we just compile all the device libraries to LLVM bitcode and be done with it. Then we can write them using all the clang-dialect, not-nvcc-emulating, C++23 we feel like, and it'll still work when someone imports them into their c++98 CUDA project from hell. :D
Wondering if there's an ongoing effort to do the same with MPS/Metal as a backend. If anything given how many developers are on macs I think it could get immense traction.
Very clearly the business motive make sense, go after nvidia gpu monopoly. Can someone help a lay person understand the pitfalls here that prevent this from being an intelligent venture?
It's technically non-trivial and deeply irritating to implement in places as people expect bugward compatibility with cuda.
Also nvidia might savage you with lawyers for threatening their revenue stream. Big companies can kill small ones by strangling them in the courts then paying the fine when they lose a decade later.
Heh, true. On the other hand, I bet companies are eager to challenge the wrath of a $3T company for a promise of "maybe it'll work, not all of it but at least it'll run worse, at least for now".
I don't think the terms of the Nvidia SDK can restrict running software without said SDK. Nvidia's libraries don't seem to be involved here. Their hardware isn't involved either. It's just some ascii in a bunch of text files being hacked around with before running on someone else's hardware.
People can be wildly hostile to changing their programs. The people who wrote it aren't here any more, the program was validated as-is, changing it tends to stop the magic thing working and so forth.
That changing the compiler is strongly equivalent to changing the source doesn't necessarily influence this pattern of thinking. Customer requests to keep the performance gains from a new compiler but not change the UB they were relying on with the old are definitely a thing.
Whatever IP related issues they’d want to sue over. Sorry I don’t know specifics about what this would specifically infringe but I’m sure expensive legal brains could come up with something
It works exactly as well as other AMDGPU-related software (HIP etc.) works inside Docker.
There are some delightful AMD driver issues that make certain models of GPU intermittently freeze the kernel when used from docker. That was great fun when building SCALE's CI system :D.
Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.