The problem is that performance achievements on AMD consumer-grade GPUs (RX7900XTX) are not representative/transferrable to the Datacenter grade GPUs (MI300X). Consumer GPUs are based on RDNA architecture, while datacenter GPUs are based on the CDNA architecture, and only sometime in ~2026 AMD is expected to release unifying UDNA architecture [1]. At CentML we are currently working on integrating AMD CDNA and HIP support into our Hidet deep learning compiler [2], which will also power inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and AWS Inf2 chips on our platform [3]
The problem is that the specs of AMD consumer-grade GPUs do not translate to computer performance when you try and chain more than one together.
I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.
The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.
On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).
Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.
Even the single precision given by the previous poster is seldom used for inference or training.
Because the previous poster had mentioned only single precision, where RTX 4090 is better, I had to complete the data with double precision, where RTX 4090 is worse, and memory bandwidth where RTX 4090 is the same, otherwise people may believe that progress in GPUs over 5 years has been much greater than it really is.
Moreover, memory bandwidth is very relevant for inference, much more relevant than FP32 throughput.
You might find the journey of Tinycorp's Tinybox interesting, it's a machine with 6 to 8 4090 GPUs and you should be able to track down a lot of their hardware choices including pictures on their Twitter and other info on George his livestreams.
EPYC + Supermicro + C-Payne retimers/cabling. 208-240V power typically mandatory for the most affordable power supplies (chain a server/crypto PSU for the GPUs from ParallelMiner to an ATX PSU for general use).
The ASRock Rack ROMED8-2T has seven PCIe x16 slots. They're too close together to directly put seven 4090s on the board, but you'd just need some riser cables to mount the cards on a frame.
It looks like AMD's CDNA gpu's are supported by Mesa, which ought to suffice for Vulkan Compute and SYCL support. So there should be ways to run ML workloads on the hardware without going through HIP/ROCm.
I have come across quite few startups who are trying a similar idea: break the nvidia monopoly by utilizing AMD GPUs (for inference at least): Felafax, Lamini, tensorwave (partially), SlashML. Even saw optimistic claims like CUDA moat is only 18 months deep from some of them [1]. Let's see.
AMD GPUs are becoming a serious contender for LLM inference. vLLM is already showing impressive performance on AMD [1], even with consumer-grade Radeon cards (even support GGUF) [2]. This could be a game-changer for folks who want to run LLMs without shelling out for expensive NVIDIA hardware.
AMD decided not to release a high-end GPU this cycle so any investment into 7x00 or 6x00 is going to be wasted as Nvidia 5x00 is likely going to destroy any ROI from the older cards and AMD won't have an answer for at least two years, possibly never due to being non-existing in high-end consumer GPUs usable for compute.
Peculiar business model, at a glance. It seems like they're doing work that AMD ought to be doing, and is probably doing behind the scenes. Who is the customer for a third-party GPU driver shim?
I've recently been poking around with Intel oneAPI and IPEX-LLM. While there are things that I find refreshing (like their ability to actually respond to bug reports in a timely manner, or at all) on a whole, support/maturity actually doesn't match the current state of ROCm.
PyTorch requires it's own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work - both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked, etc, etc.
I've recently been trying to get IPEX working as well, apparently picking Ubuntu 24.04 was a mistake, because while things compile, everything fails at runtime. I've tried native, docker, different oneAPI versions, threw away a solid week of afternoons for nothing.
SYCL with llama.cpp is great though, at least at FP16 since it supports nothing else but even Arc iGPUs easily give 2-4x performance compared to CPU inference.
Intel should've just contributed to SYCL instead of trying to make their own thing and then forgot to keep maintaining it halfway through.
My testing has been w/ a Lunar Lake Core 258V chip (Xe2 - Arc 140V) on Arch Linux. It sounds like you've tried a lot of things already, but case it helps, my notes for installing llama.cpp and PyTorch: https://llm-tracker.info/howto/Intel-GPUs
I have some benchmarks as well, and the IPEX-LLM backend performed a fair bit better than the SYCL llama.cpp backend for me (almost +50% pp512 and almost 2X tg128) so worth getting it working if you plan on using llama.cpp much on an Intel system. SYCL still performs significantly better than Vulkan and CPU backends, though.
As an end-user, I agree that it'd be way better if they could just contribute upstream somehow (whether to the SYCL backend, or if not possible, to a dependency-minized IPEX backend). the IPEX backend is one of the more maintained parts of IPEX-LLM, btw. I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Well that's funny, I think we already spoke on Reddit. I'm the guy who was testing the 125H recently. I guess there's like 5 of us who have intel hardware in total and we keep running into each other :P
Honestly I think there's just something seriously broken with the way IPEX expects the GPU driver to be on 24.04 and there's nothing I can really do about it except wait for them to fix it if I want to keep using this OS.
I am vaguely considering adding another drive and installing 22.04 or 20.04 with the exact kernel they want to see if that might finally work in the meantime, but honestly I'm fairly satisfied with the speed I get from SYCL already. The problem is more that it's annoying to integrate it directly through the server endpoint, every projects expects a damn ollama api or llama-cpp-python these days and I'm a fan of neither since it's just another layer of headaches to get those compiled with SYCL.
> I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."
Yeah well the fact that oneAPI 2025 got released, broke IPEX, and they still haven't figured out a way to patch it for months makes me think it's total chaos internally, where teams work against each other instead of talking and coordinating.
Fwiw on 22.04 i can use current kernel but otherwise follow Intel's instructions and the stuff works (old as it is now). I'm currently trying to figure out the best way to finetune Qwen 2.5 3B, the old axolotl ain't up to it. Not sure if I'm gonna work on a fork of axolotl or try something else at this point.
Big agree on Intel working on SYCL. I've millions of tasks thru SYCL llama.cpp at this point, and though SYCL reliably does 5-6x the prompt processing speed of the Vulkan builds, current Vulkan builds are now up to 50% faster at token generation than SYCL on my Intel GPU
More cynical take: this would be a bad strategy, because Intel hasn't shown much competence in its leadership for a long time, especially in regards to GPUs.
B580 being a "success" is purely a business decision as a loss leader to get their name into the market. A larger die on a newer node than either Nvidia or AMD means their per-unit costs are higher, and are selling it at a lower price.
That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.
It’s a long term strategy to release a hardware platform with minimal margins in the beginning to attract software support needed for long term viability.
I was reading this whole thread as about technical accomplishment and non-nvidia GPU capabilities, not business. So I think you're talking about different definitions of "Success". Definitely counts, but not what I was reading.
Is it a loss leader? I looked up the price of 16Gbit GDDR6 ICs the other day at dramexchange and the cost of 12GB is $48. Using the gamer nexus die measurements, we can calculate that they get at least 214 dies per wafer. At $12095 per wafer, which is reportedly the price at TSMC for 5nm wafers in 2025, that is $57 per die.
While defects ordinarily reduce yields, Intel put plenty of redundant transistors into the silicon. This is ordinarily not possible to estimate, but Tom Petersen reported in his interview with hardware unboxed that they did not count those when reporting the transistor count. Given that the density based on reported transistors is about 40% less than the density others get from the same process and the silicon in GPUs is already fairly redundant, they likely have a backup component for just about everything on the die. The consequence is that they should be able to use at least 99% of those dies even after tossing unusable dies, such that the $57 per die figure is likely correct.
As for the rest of the card, there is not much in it that would not be part of the price of an $80 Asrock motherboard. The main thing would be the bundled game, which they likely can get in bulk at around $5 per copy. This seems reasonable given how much Epic games pays for their giveaways:
That brings the total cost to $190. If we assume Asrock and the retailer both have a 10% margin on the $80 motherboard used as a substitute for the costs of the rest of the things, then it is $174. Then we need to add margins for board partners and the retailers. If we assume they both get 10% of the $250, then that leaves a $26 profit for Intel, provided that they have economics of scale such that the $80 motherboard approximation for the rest of the cost of the graphics card is accurate.
That is about a 10% margin for Intel. That is not a huge margin, but provided enough sales volume (to match the sales volume Asrock gets on their $80 motherboards), Intel should turn a profit on these versus not selling these at all. Interestingly, their board partners are not able/willing to hit the $250 MSRP and the closest they come to it is $260 so Intel is likely not sharing very much with them.
It should be noted that Tom Petersen claimed during his hardware unboxed interview that they were not making money on these. However, that predated the B580 being a hit and likely relied on expected low production volumes due to low sales projections. Since the B580 is a hit and napkin math says it is profitable as long as they build enough of them, I imagine that they are ramping production to meet demand and reach profitability.
That's just BOM. When you factor in R&D they are clearly still losing money on B580. There's no way they can recoup R&D this generation with a 10% gross margin.
Still, that's to be expected considering this is still only the second generation of Arc. If they can break even on the next gen, that would be an accomplishment.
To be fair, the R&D is shared with Intel’s integrated graphics as they use the same IP blocks, so they really only need to recoup the R&D that was needed to turn that into a discrete GPU. I do not know how much that is to make any definitive statements, but I can speculate that if it is $50 million and they sell 10 million of these, they more than recoup it. Even if they fail to recoup their R&D funds, they would be losing more money by not selling these at all, since no sales means 0 dollars of R&D would be recouped.
I don’t know if this matters but while the B580 has a die comparable in size to a 4070 (~280mm^2), it has about half the transistors (~17-18 billion), iirc.
Tom Petersen said in a hardware unboxed video that they only reported “active” transistors, such that there are more transistors in the B580 than what they reported. I do not think this is the correct way to report them since one, TSMC counts all transistors when reporting the density of their process and two, Intel is unlikely to reduce the reported transistor count for the B570, which will certainly have fewer active transistors.
That said, the 4070 die is 294mm^2 while the B580 die is 272mm^2.
Yeah but MLID says they are losing money on every one and have been winding down the internal development resources. That doesn't bode well for the future.
I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.
MLID on Intel is starting to become the same as UserBenchmark on AMD (except for the generally reputable sources)... he's beginning to sound like he simply wants Intel to fail, to my insider-info-lacking ears. For competition's sake I really hope that MLID has it wrong (at least the opining about the imminent failure of Intel's GPU division), and that the B series will encourage Intel to push farther to spark more competition in the GPU space.
The margins might be describable as razor thin, but they are there. Whether it can recoup the R&D that they spent designing it is hard to say definitively since I do not have numbers for their R&D costs. However, their iGPUs share the same IP blocks, so the iGPUs should be able to recoup the R&D costs that they have in common with the discrete version. Presumably, Intel can recoup the costs specific to the discrete version if they sell enough discrete cards.
While this is not a great picture, it is not terrible either. As long as Intel keeps improving its graphics technology with each generation, profitability should gradually improve. Although I have no insider knowledge, I noticed a few things that they could change to improve their profitability in the next generation:
* Tom Petersen made a big deal about 16-lane SIMD in Battlemage being what games want rather than the 8-lane SIMD in Alchemist. However, that is not quite true since both Nvidia and AMD graphics use 32-lane SIMD. If the number of lanes really matter and I certainly can see how it would if game shaders have horizontal operations, then a switch to 32-lane SIMD should yield further improvements.
* Tom Petersen said in his interview with Hardware Unboxed that Intel reported the active transistor count for the B580 rather than the total transistor count. This is the contrary to others who report the total transistor count (as evidenced by their density figures being close to what TSMC claims the process can do). Tom Petersen also stated that they would not necessarily be forced by defects to turn dies into B570 cards. This suggests to me that they have substantial redundant logic in the GPU to prevent defects from rendering chips unusable, and that logic is intended to be disabled in production. GPUs are already highly redundant. They could drop much of the planned dark silicon and let defects force a larger percentage of the dies to be usable by only cutdown models.
I could have read too much into things that Tom Petersen said. Then again, he did say that their design team is conservative and the doubling rather than quadrupling of the SIMD lane count and the sheer amount of dark silicon (>40% of the die by my calculation) spent on what should be redundant components strike me as conservative design choices. Hopefully the next generation addresses these things.
Also, they really do have >40% dark silicon when doing density comparisons:
They have 41% less density than Nvidia and 48% less density than TSMC claims the process can obtain. We also know that they have additional transistors on the die that are not active from Tom Petersen’s comments. Presumably, they are for redundancy. Otherwise, there really is no sane explanation that I can see for so much dark silicon. If they are using transistors that are twice the size as the density figure might be interpreted to suggest, they might as well have used TSMC’s 7nm process since while a smaller process can etch larger features, it is a waste of money.
Note that we can rule out the cache lowering the density. The L1 + L2 cache on the 4070 Ti is 79872 KB while it is 59392 KB on the B580. We can also rule out IO logic as lowering the density, as the 4070 Ti has a 256-bit memory bus while the B580 has a 192-bit memory bus.
> Tom Petersen made a big deal about 16-lane SIMD in Battlemage [...]
Where? The only mention I see in that interview is him briefly saying they have native 16 with "simple emulation" for 32 because some games want 32. I see no mention of or comparison to 8.
And it doesn't make sense to me that switching to actual 32 would be an improvement. Wider means less flexible here. I'd say a more accurate framing is whether the control circuitry is 1/8 or 1/16 or 1/32. Faking extra width is the part that is useful and also pretty easy.
For context, Alchemist was SIMD8 in Intel’s terminology. They made a big deal out of this at the alchemist launch if I recall correctly since they thought it would be more efficient. Unfortunately, it turned out to not be more efficient.
Anyway, Tom Petersen did a bunch of interviews before the Intel B580 launch. In the hardware unboxed interview, he mentioned it, but accidentally misspoke. I must have interpreted his misspeak as meaning games want SIMD16 and noted it that way in my mind, as what he says elsewhere seems to suggest that games want SIMD16. It was only after thinking about what I heard that I realized otherwise. Here is an interview where he talks about native SIMD16 being better:
> We also have native SIMD support, SIMD16 native support, which is going to say that you don’t have to like recode your computer shader to match a particular topology. You can use the one that you use for everyone else, and it’ll just run well on ARC. So I’m pretty exited about that.
In an interview with gamers nexus, he has a nice slide where he attributes a performance gain directly to SIMD16:
At the start of the gamers nexus video, Steve mentions that Tom‘s slides are from a presentation. I vaguely remember seeing a video of it where he talked about SIMD16 being an improvement more, but I am having trouble finding it.
As for 32 lane SIMD being an improvement over 16 lanes, while I do not write shaders, I have written CUDA kernels and in CUDA kernels, you sometimes need to do what Nvidia calls a parallel reduction across lanes (Intel’s CPU division calls them horizontal operations). For example, you might need to sum across all lanes in order to calculate an average. When you have native 32 lane SIMD, you can do this without going to shared memory, which is extremely fast. If you need to emulate a higher lane width, you need to do a trip to shared memory key, which is not as fast. If games shaders are written with an assumption that 32 lane SIMD is used, then having 32 lane SIMD is going to be more performant of these. Intel’s slide attributes a 0.3ms reduction in render time to 16 lane SIMD and they likely would see a further reduction with 32 lane SIMD since that is what games should actually use as that is what both AMD (since RDNA 1) and Nvidia (since Turing) use.
The die size of the B580 is 272 mm2, which is a lot of silicon for $249. The performance of the GPU is good for its price but bad for its die size. Manufacturing cost is closely tied to die size.
272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.
Though you assume the prices of the competition are reasonable. There are plenty of reasons for them not to be. Availability issues, lack of competition, other more lucrative avenues etc.
Intel has neither, or at least not as much of them.
At a loss seems a bit overly dramatic. I'd guess Nvidia sells SKUs for three times their marginal cost. Intel is probably operating at cost without any hopes of recouping R&D with the current SKUs, but that's reasonable for an aspiring competitor.
The only way this would be at a loss is if they refuse to raise production to meet demand. That said, I believe their margins on these are unusually low for the industry. They might even fall into razor thin territory.
Wait, are they losing money on every one in the sense that they haven't broken even on research and development yet? Or in the sense that they cost more to manufacture than they're sold at? Because one is much worse than the other.
That being said, the IP blocks are shared by their iGPUs, so the discrete GPUs do not need to recoup the costs of most of the R&D, as it would have been done anyway for the iGPUs.
That guy’s reasoning is faulty. To start, he has made math mistakes in every video that he has posted recently involving math. To give 3 recent examples:
At 10m3s in the following video, he claims to add a 60% margin by multiplying by 1.6, but in reality is adding a 37.5 margin and needed to multiply by 2.5 to add a 60% margin. This can be calculated by calculating Cost Scaling Factor = 1 / (1 - Normalized Profit Margin):
At 48m13s in the following video, he claims that Intel’s B580 is 80% worse than Nvidia’s hardware. He took the 4070 Ti as being 82% better than the 2080 SUPER, assumed based on leaks from his reviewer friends that the B580 was about at the performance of the 2080 SUPER and then claimed that the B580 would be around 80% worse than the 4070 Ti. Unfortunately for him, that is 45% worse, not 80% worse. His chart is from Techpowerup and if he had taken the time to do some math (1 - 1/(1 + 0.82) ~ 0.45), or clicked to the 2080 SUPER page, he would have seen it has 55% of the performance of the 4070 Ti, which is 45% worse:
At 1m2s in the following video, he makes a similar math mistake by saying that the B580 has 8% better price/performance than the RTX 3060 when in fact it is 9% better. He mistakenly equated the RTX 3060 being 8% worse than the B580 to mean that it is 8% better, but math does not work that way. Luckily for him, the math error is small here, but he still failed to do math correctly and his reasoning grows increasingly faulty with the scale of his math errors. What he should have done that gives the correct normalized factor is:
He not just fails at mathematical reasoning, but lacks a basic understanding of how hardware manufacturing works. He said that if Intel loses $20 per card in low production volumes, then making 10 million cards will result in a $200 million loss. In reality, things become cheaper due to economics of scale and simple napkin math shows that they can turn a profit on these cards:
His behavior is consistent with being on a vendetta rather than being a technology journalist. For example, at 55m13s in the following video, he puts words in Tom Petersen’s mouth and then with a malicious smile on his mouth, cheers while claiming that Tom Petersen declared discrete ARC cards to be dead when Tom Petersen said nothing of the kind. Earlier in the same video at around 44m14s, he calls Tom Petersen a professional liar. However, he sees no problem expecting people to believe words he shoved into the “liar’s” mouth:
If you scrutinize his replies to criticism in his comments section, you would see he is dodging criticism of the actual issues with his coverage while saying “I was right about <insert thing completely unrelated to the complaint here>” or “facts don’t care about your feelings”. You would also notice that he is copy and pasting the same statements rather than writing replies addressing the details of the complaints. To be clear, I am paraphrasing in those two quotes.
He also shows contempt for his viewers that object to his behavior in the following video around 18m53s where he calls them “corporate cheerleaders”:
In short, Tom at MLID is unable to do mathematical reasoning, does not understand how hardware manufacturing works, has a clear vendetta against Intel’s discrete graphics, is unable to take constructive criticism and lashes out at those who try to tell him when he is wrong. I suggest being skeptical of anything he says about Intel’s graphics division.
Interesting. I wonder if focusing on GPUs and CPUs is something that requires two companies instead of one, whether the concentration of resources just leads to one arm of your company being much better than the other.
> Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.
From their announcement on 20241219[^0]:
"We are the only company to get AMD on MLPerf, and we have a completely custom driver that's 50x simpler than the stock one. A bit shocked by how little AMD cared, but we'll take the trillions instead of them."
From 20241211[^1]:
"We gave up and soon tinygrad will depend on 0 AMD code except what's required by code signing.
We did this for the 7900XTX (tinybox red). If AMD was thinking strategically, they'd be begging us to take some free MI300s to add support for it."
Is there no hope for AMD anymore? After George Hotz/Tinygrad gave up on AMD I feel there’s no realistic chance of using their chips to break the CUDA dominance.
Maybe from Modular (the company Chris Lattner is working for). In this recent announcement they said they had achieved competitive ML performance… on NVIDIA GPUs, but with their own custom stack completely replacing CUDA. And they’re targeting AMD next.
Quite frankly, I have difficulty reconciling a lot of comments here with that, and my own experience as an AMD GPU user (although not for compute, and not on Windows).
tl;dr there's a non-unsubstantial # of people who learn a lot from geohot. I'd say about 3% of people here will be confused if you thought of him as less than a top technical expert across many comp sci fields.
And he did the geohot thing recently, way tl;dr: acted like there was a scandal being covered up by AMD around drivers that was causing them to "lose" to nVidia.
He then framed AMD not engaging with him on this topic as further covering-up and choosing to lose.
So if you're of a certain set of experiences, you see an anodyne quote from the CEO that would have been utterly unsurprising dating back to when ATI was still a company, and you'd read it as the CEO breezily admitting in public that geohot was right about how there was malfeasance, followed by a cover up, implying extreme dereliction of duty, because she either helped or didn't realize till now.
I'd argue this is partially due to stonk-ification of discussions, there was a vague, yet often communicated, sense there was something illegal happening. Idea was it was financial dereliction of duty to shareholders.
IMO the hope shouldn't be that AMD specifically wins, rather it's best for consumers that hardware becomes commoditized and prices come down.
And that's what's happening, slowly anyway. Google, Apple and Amazon all have their own AI chips, Intel has Gaudi, AMD had their thing, and the software is at least working on more than just Nvidia. Which is a win. Even if it's not perfect. I'm personally hoping that everyone piles in on a standard like SYCL.
In CPUs, AMD has made many innovations that have been copied by Intel only after many years and this delay had an important contribution to Intel's downfall.
The most important has been the fact that AMD has predicted correctly that big monolithic CPUs will no longer be feasible in the future CMOS fabrication technologies, so they have designed the Zen family since the beginning with a chiplet-based architecture. Intel had attempted to ridicule them, but after losing many billions they have been forced to copy this strategy.
Also in the microarchitecture of their CPUs AMD has made the right choices since the beginning and then they have improved it constantly with each generation. The result is that now the latest Intel big core, Lion Cove, has a microarchitecture that is much more similar to AMD Zen 5 than to any of the previous Intel cores, because they had to do this to get a competitive core.
In the distant past, AMD has also introduced a lot of innovations long before they were copied by Intel, but it is true that those had not been invented by AMD, but they had been copied by AMD from more expensive CPUs, like DEC Alpha or Cray or IBM POWER, but Intel has also copied them only after being forced by the competition with AMD.
Everything is comparative. AMD isn't perfect. As an Ex Shareholder I have argued they did well partly because of Intel's downfall. In terms of execution it is far from perfect.
But Nvidia is a different beast. It is a bit like Apple in the late 00s where you take business, forecast, marketing, operation, software, hardware, sales etc You take any part of it and they are all industry leading. And having industry leading capability is only part of the game, having it all work together is completely another thing. And unlike Apple where they lost direction once Steve Jobs passed away and weren't sure about how to deploy capital. Jensen is still here, and they have more resources now making Nvidia even more competitive.
It is often most people underestimate the magnitude of the task required, ( I like to tell the story again about an Intel GPU engineer in 2016 arguing they could take dGPU market shares by 2020, and we are now 2025 ), over estimate the capability of an organisation, under estimate the rival's speed of innovation and execution. These three thing combined is why most people are often off the estimate by an order of magnitude.
We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now. I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
By comparison if AMD could write a driver that didn't shit itself when it had to multiply more than two matrices in a row they'd be selling cards faster than they can make them. You don't need to sell the best shovels in a gold rush to make mountains of money, but you can't sell teaspoons as premium shovels and expect people to come back.
They... do have a monopoly on foundry capacity, especially if you're looking at the most advanced nodes? Nobody's going to Intel or Samsung to build 3nm processors. Hell, there have been whispers over the past month that even Samsung might start outsourcing Exynos to TSMC; Intel already did that with Lunar Lake.
Having a monopoly doesn't mean that you are engaging in anticompetitive behavior, just that you are the only real option in town.
What effect did the DOJ have on MS in the 90s? Didn't all of that get rolled back before they had to pay a dime, and all it amounted to was that browser choice screen that was around for a while? Hardly a crippling blow. If anything that showed the weakness of regulators in fights against big tech, just outlast them and you're fine.
>I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.
It sounds like you're expecting extreme competence from the DOJ. Given their history with regulating big tech companies, and even worse, the incoming administration, I think this is a very unrealistic expectation.
Also I'd take HN as being being an amazing platform for the overall consistency and quality of moderation. Anything beyond that depends more on who you're talking to than where at.
Oh, there's basically no chance of getting that on the Internet.
The Internet is a machine that highly simplifies the otherwise complex technical challenge of wide-casting ignorance. It wide-casts wisdom too, but it's an exercise for the reader to distinguish them.
Everyone whose dug deep into what AMD is doing has left in disgust if they are lucky and bankruptcy if they are not.
If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
> If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.
This seems like unuseful advice if you've already given up on them.
You tried it and at some point in the past it wasn't ready. But by not being ready they're losing money, so they have a direct incentive to fix it. Which would take a certain amount of time, but once you've given up you no longer know if they've done it yet or not, at which point your advice would be stale.
Meanwhile the people who attempt it apparently seem to get acquired by Nvidia, for some strange reason. Which implies it should be a worthwhile thing to do. If they've fixed it by now which you wouldn't know if you've stopped looking, or they fix it in the near future, you have a competitive advantage because you have access to lower cost GPUs than your rivals. If not, but you've demonstrated a serious attempt to fix it for everyone yourself, Nvidia comes to you with a sack full of money to make sure you don't finish, and then you get a sack full of money. That's win/win, so rather than nobody doing it, it seems like everybody should be doing it.
I've seen people try it every six months for two decades now.
At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.
I'm deeply worried about stagnation in the CPU space now that they are top dog and Intel is dead in the water.
Here's hoping China and Risk V save us.
>Meanwhile the people who attempt it apparently seem to get acquired by Nvidia
Everyone I've seen base jumping has gotten a sponsorship from redbull, ergo. everyone should basejump.
> At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.
AMD has always punched above their weight. Historically their problem was that they were the much smaller company and under heavy resource constraints.
Around the turn of the century the Athlon was faster than the Pentium III and then they made x86 64-bit when Intel was trying to screw everyone with Itanic. But the Pentium 4 was a marketing-optimized design that maximized clock speed at the expense of heat and performance per clock. Intel was outselling them even though the Athlon 64 was at least as good if not better. The Pentium 4 was rubbish for laptops because of the heat problems, so Intel eventually had to design a separate chip for that, but they also had the resources to do it.
That was the point that AMD made their biggest mistake. When they set out to design their next chip the competition was the Pentium 4, so they made a power-hungry monster designed to hit high clock speeds at the expense of performance per clock. But the reason more people didn't buy the Athlon 64 wasn't that they couldn't figure out that a 2.4GHz CPU could be faster than a 2.8GHz CPU, it was all the anti-competitive shenanigans Intel was doing behind closed doors to e.g. keep PC OEMs from featuring systems with AMD CPUs. Meanwhile by then Intel had figured out that the Pentium 4 was, in fact, a bad design, when their own Pentium M laptops started outperforming the Pentium 4 desktops. So the Pentium 4 line got canceled and Bulldozer had to go up against the Pentium M-based Core, which nearly bankrupted AMD and compromised their ability to fund the R&D needed to sustain state of the art fabs.
Since then they've been climbing back out of the hole but it wasn't until Ryzen in 2017 that you could safely conclude they weren't on the verge of bankruptcy, and even then they were saddled with a lot of debt and contracts requiring them to use the uncompetitive Global Foundries fabs for several years. It wasn't until Zen4 in 2022 that they finally got to switch the whole package to TSMC.
So until quite recently the answer to the question "why didn't they do X?" was obvious. They didn't have the money. But now they do.
Have you tried compute shaders instead of that weird HPC-only stuff?
Compute shaders are widely used by millions of gamers every day. GPU vendors have huge incentive to make them reliable and efficient: modern game engines are using them for lots of thing, e.g. UE5 can even render triangle meshes with GPU compute instead of graphics (the tech is called nanite virtualized geometry). In practice they work fine on all GPUs, ML included: https://github.com/Const-me/Cgml
I'd be very concerned if somebody makes a $100K decision based on a comment where the author couldn't even differentiate between the words "constitutionally" and "institutionally", while providing as much substance as any other random techbro on any random forum and being overwhelmingly oblivious to it.
I have been playing around with Phi-4 Q6 on my 7950x and 7900XT (with HSA_OVERRIDE_GFX_VERSION). It's bloody fast, even with CPU alone - in practical terms it beats hosted models due to the roundtrip time. Obviously perf is more important if you're hosting this stuff, but we've definitely reached AMD usability at home.
It’s not terribly hard to port ML inference to alternative GPU APIs. I did it for D3D11 and the performance is pretty good too: https://github.com/Const-me/Cgml
The only catch is, for some reason developers of ML libraries like PyTorch aren’t interested in open GPU APIs like D3D or Vulkan. Instead, they focus on proprietary ones i.e. CUDA and to lesser extent ROCm. I don’t know why that is.
D3D-based videogames are heavily using GPU compute for more than a decade now. Since Valve shipped SteamDeck, the same now applies to Vulkan on Linux. By now, both technologies are stable, reliable and performant.
Isn't part of it because the first-party libraries like cuDNN are only available through CUDA? Nvidia has poured a ton of effort into tuning those libraries so it's hard to justify not using them.
Unlike training, ML inference is almost always bound by memory bandwidth as opposed to computations. For this reason, tensor cores, cuDNN, and other advanced shenanigans make very little sense for the use case.
OTOH, general-purpose compute instead of fixed-function blocks used by cuDNN enables custom compression algorithms for these weights which does help, by saving memory bandwidth. For example, I did custom 5 bits/weight quantization which works on all GPUs, no hardware support necessary, just simple HLSL codes: https://github.com/Const-me/Cgml?tab=readme-ov-file#bcml1-co...
Only local (read batch size 1) ML inference is memory bound, production loads are pretty much compute bound. Prefill phase is very compute bound, and with continuous batching generation phase is getting mixed with prefill, which makes whole process altogether to be compute bound too. So no, tensor cores and all other shenanigans absolutely critical for performant inference infrastructure.
PyTorch is a project by Linux foundation. The about page with the mission of the foundation contains phrases like “empowering generations of open source innovators”, “democratize code”, and “removing barriers to adoption”.
I would argue running local inference with batch size=1 is more useful for empowering innovators compared to running production loads on shared servers owned by companies. Local inference increases count of potential innovators by orders of magnitude.
BTW, in the long run it may also benefit these companies because in theory, an easy migration path from CUDA puts a downward pressure on nVidia’s prices.
Most people running local inference do so thorough quants with llamacpp (which runs on everything) or awq/exl2/mlx with vllm/tabbyAPI/lmstudio which are much faster to than using pytorch directly
llama.cpp has a much bigger supported model list, as does vLLM and of course PyTorch/HF transformers covers everything else, all of which work w/ ROCm on RDNA3 w/o too much fuss these days.
For inference, the biggest caveat is that Flash Attention is only an aotriton implementation, which besides being less performant sometimes, also doesn't support SWA. For CDNA there is a better CK-based version of FA, but CK doesn't not have RDNA support. There are a couple people at AMD apparently working on native FlexAttention, os I guess we'll how that turns out.
(Note the recent SemiAccurate piece was on training, which I'd agree is in a much worse state (I have personal experience with it being often broken for even the simplest distributed training runs). Funnily enough, if you're running simple fine tunes on a single RDNA3 card, you'll probably have a better time. OOTB, a 7900 XTX will train at about the same speed as an RTX 3090 (4090s blow both of those away, but you'll probably want more cards and VRAM of just move to H100s).
Great, I have yet to understand why does not the ML community really push or move away from CUDA? To me, it feel like a dinosaur move to build on top of CUDA which is screaming proprietary nothing about it is open source or cross platform.
The reason why I say its dinosaur is, imagine, we as a dev community continued to build on top of Flash or Microsoft Silverlight...
LLM and ML has been out for quiet a while, with AI/LLM advancement, the transition must have been much quicker to move cross platform. But this hasn't yet and not sure when it will happen.
Building a translation layer on top CUDA is not the answer either to this problem.
For me personally, hacking together projects as a hobbiest, 2 reasons :
1. It just works. When i tried to build things on Intel Arcs, i spent way more hours bikeshedding ipex and driver issues than developing
2. LLMs seem to have more cuda code in their training data. I can leverage claude and 4o to help me build things with cuda, but trying to get them to help me do the same things on ipex just doesn't work.
I'd very much love a translation layer for Cuda, like a dxvk or wine equivalent.
Would save a lot of money since Arc gpus are in the bargain bin and nvidia cloud servers are double the price of AMD.
As it stands now, my dual Intel Arc rig is now just a llama.cpp inference server for the family to use.
If CUDA counts as “just works”, I dread to see the dark, unholy rituals you need to invoke to get ROCm to work. I have spent too many hours browsing the Nvidia forums for obscure error codes and driver messages to ever consider updating my CUDA install and every time I reboot my desktop for an update I dread having to do it all over again.
Except I never hear complaints about CUDA from a quality perspective. The complaints are always about lock in to the best GPUs on the market. The desire to shift away is to make cheaper hardware with inferior software quality more usable. Flash was an abomination, CUDA is not.
Flash was popular because it was an attractive platform for the developer. Back then there was no HTML5 and browsers didn't otherwise support a lot of the things Flash did. Flash Player was an abomination, it was crashy and full of security vulnerabilities, but that was a problem for the user rather than the developer and it was the developer choosing what to use to make the site.
This is pretty much exactly what happens with CUDA. Developers like it but then the users have to use expensive hardware with proprietary drivers/firmware, which is the relevant abomination. But users have some ability to influence developers, so as soon as we get the GPU equivalent of HTML5, what happens?
There are far more people running llama.cpp, various image generators, etc. than there are people developing that code. Even when the "users" are corporate entities, they're not necessarily doing any development in excess of integrating the existing code with their other systems.
We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.
What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.
The cuda situation is definitely better. The nvidia struggles are now with the higher-level software they’re pushing (triton, tensor-llm, riva, etc), tools that are the most performant option when they work, but a garbage developer experience when you step outside the golden path
I want to double-down on this statement, and call attention to the competitive nature of it. Specifically, I have recently tried to set up Triton on arm hardware. One might presume Nvidia would give attention to an architecture they develop, but the way forward is not easy. For some version of Ubuntu, you might have the correct version of python ( usually older than packaged ) but current LTS is out of luck for guidance or packages.
I think you’ve mixed up your Triton’s; I’m talking about Triton Inference Server from NVIDIA while you’re talking about Triton the CUDA replacement from OpenAI
I believe these efforts are very important. If we want this stuff to be practical we are going to have to work on efficiency. Price efficiency is good. Power and compute efficiency would be better.
I have been playing with llama.cpp to run interference on conventional cpus. No conclusions but it's interesting. I need to look at llamafile next.
Reality check for anyone considering this: I just got a used 3090 for $900 last month. It works great.
I would not recommend buying one for $600, it probably either won’t arrive or will be broken. Someone will reply saying they got one for $600 and it works, that doesn’t mean it will happen if you do it.
I’d say the market is realistically $900-1100, maybe $800 if you know the person or can watch the card running first.
All that said, this advice will expire in a month or two when the 5090 comes out.
I've bought 5 used and they're all perfect. But that's what buyer protection on ebay is for. Had to send back an Epyc mobo with bent pins and ebay handled it fine.
I've bought used 3090 last year for ML and while it works fine, has correct DRAM and stuff, when I tried gaming on it I've noticed that it is significantly slower than my 3080. I'm not sure if the seller has pulled some shenanigans on me or the card actually degraded during whatever mining they did.
Just beware, the card might be "working fine" on a first glance, but actually be damaged.
Modular claims that it achieves 93% GPU utilization on AMD GPUs [1], official preview release coming early next year, we'll see. I must say I'm bullish because of feedback I've seen people give about the performance on Nvidia GPUs
Just an FYI, this is writeup from August 2023 and a lot has changed (for the better!) for RDNA3 AI/ML support.
That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).
That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).
Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.
At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.
There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).
Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs
Intriguing. I thought AMD GPUs didn't have tensor cores (or matrix multiplication units) like NVidia. I believe they are only dot product / fused multiply and accumulate instructions.
Are these LLMs just absurdly memory bound so it doesn't matter?
They absolutely do have similar cores to tensor cores, it's called matrix cores. And they have particular instructions to utilize them (MFMA).
Note I'm talking about DC compute chips, like MI300.
LLMs aren't memory bound in production loads, they are pretty much compute bound too, at least in prefill phase, but in practice in general too.
They don’t, but GPUs were designed for doing matrix multiplications even without the special hardware instructions for doing matrix multiplication tiles. Also, the forward pass for transformers is memory bound, and that is what does token generation.
> If RAM is the main bottleneck then CPUs should be on the table
That's certainly not the case. The graphics memory model is very different from the CPU memory model. Graphics memory is explicitly designed for multiple simultaneous reads (spread across several different buses) at the cost of generality (only portions of memory may be available on each bus) and speed (the extra complexity means reads are slower). This makes then fast at doing simple operations on a large amount of data.
CPU memory only has one bus, so only a single read can happen at a time (a cache line read), but can happen relatively quickly. So CPUs are better for workloads with high memory locality and frequent reuse of memory locations (as is common in procedural programs).
Memory bandwidth is the bottleneck for both when running GEMV, which is the main operation used by token generation in inference. It has always been this way.
Gigabytes per second? What is this, bandwidth for ants?
My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.
Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].
[1] Forget ChatGPT: why researchers now run small AIs on their laptops:
The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.
Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.
No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.
That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.
Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.
(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)
Bandwidth between where the LLM is stored and where your matrix*vector multiplies are done is the important figure for inference. You want to measure this in terabytes per second, not gigabytes per second.
A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.
Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.
It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.
That seems unlikely given that the full HBM supply for the next year has been earmarked for enterprise GPUs. That said, it would be definitely nice if HBM became available for consumer GPUs.
I got a "gaming" PC for LLM inference with an RTX 3060. I could have gotten more VRAM for my buck with AMD, but didn't because at the time alot of inference needed CUDA.
As soon AMD is as good as Nvidia for inference, I'll switch over.
But I've read on here that their hardware engineers aren't even given enough hardware to test with...
[1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-rdn.... [2] https://centml.ai/hidet/ [3] https://centml.ai/platform/
reply