Making AMD GPUs competitive for LLM inference (2023)

pavelstoev · 2024-12-24T04:54:07 1735016047

The problem is that performance achievements on AMD consumer-grade GPUs (RX7900XTX) are not representative/transferrable to the Datacenter grade GPUs (MI300X). Consumer GPUs are based on RDNA architecture, while datacenter GPUs are based on the CDNA architecture, and only sometime in ~2026 AMD is expected to release unifying UDNA architecture [1]. At CentML we are currently working on integrating AMD CDNA and HIP support into our Hidet deep learning compiler [2], which will also power inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and AWS Inf2 chips on our platform [3]

[1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-rdn.... [2] https://centml.ai/hidet/ [3] https://centml.ai/platform/

llm_trw · 2024-12-24T07:41:25 1735026085

The problem is that the specs of AMD consumer-grade GPUs do not translate to computer performance when you try and chain more than one together.

I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.

mpreda · 2024-12-24T13:36:13 1735047373

> I have 7 NVidia 4090s under my desk

I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.

I bought them "new old stock" for $300 apiece. So that's $1800 for all six.

highwaylights · 2024-12-24T14:27:01 1735050421

How does the compute performance compare to 4090’s for these workloads?

(I release it will be significantly lower, just try to get as much of a comparison as is possible).

crest · 2024-12-24T16:20:14 1735057214

The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.

nine_k · 2024-12-24T18:08:38 1735063718

If we speak about FP64, are your loads more like fluid dynamics than ML training?

cainxinth · 2024-12-24T14:45:44 1735051544

The 4090 offers 82.58 teraflops of single-precision performance compared to the Radeon Pro VII's 13.06 teraflops.

adrian_b · 2024-12-24T16:12:46 1735056766

On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).

Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.

ryao · 2024-12-25T01:39:35 1735090775

Double precision is not used in either inference or training as far as I know.

adrian_b · 2024-12-25T11:34:36 1735126476

Even the single precision given by the previous poster is seldom used for inference or training.

Because the previous poster had mentioned only single precision, where RTX 4090 is better, I had to complete the data with double precision, where RTX 4090 is worse, and memory bandwidth where RTX 4090 is the same, otherwise people may believe that progress in GPUs over 5 years has been much greater than it really is.

Moreover, memory bandwidth is very relevant for inference, much more relevant than FP32 throughput.

llm_trw · 2024-12-24T22:50:23 1735080623

For inference sure, for training: no.

llm_trw · 2024-12-24T20:13:40 1735071220

Are you running ml workloads or solving differential equations?

The two are rather different and one market is worth trillions, the other isn't.

comboy · 2024-12-24T22:43:45 1735080225

I think there is some money to be made in machine learning too.

tspng · 2024-12-24T08:27:30 1735028850

Wow, are these 7 RTX 4090s in a single setup? Care to share more how you build it (case, cooling, power, ..)?

ghxst · 2024-12-24T11:31:19 1735039879

You might find the journey of Tinycorp's Tinybox interesting, it's a machine with 6 to 8 4090 GPUs and you should be able to track down a lot of their hardware choices including pictures on their Twitter and other info on George his livestreams.

icelancer · 2024-12-25T04:44:36 1735101876

EPYC + Supermicro + C-Payne retimers/cabling. 208-240V power typically mandatory for the most affordable power supplies (chain a server/crypto PSU for the GPUs from ParallelMiner to an ATX PSU for general use).

Beyond that, not much else.

osmarks · 2024-12-24T13:13:54 1735046034

Most of these are just an EPYC server platform, some cursed risers and multiple PSUs (though cryptominer server PSU adapters are probably better). See https://nonint.com/2022/05/30/my-deep-learning-rig/ and https://www.mov-axbx.com/wopr/wopr_concept.html.

icelancer · 2024-12-25T04:44:14 1735101854

WOPR read is the best IMO.

Keyframe · 2024-12-24T13:58:44 1735048724

Looks like a fire hazard :)

llm_trw · 2024-12-24T12:45:05 1735044305

Basically this but with an extra card on the x8 slot for connecting my monitors: https://www.youtube.com/watch?v=C548PLVwjHA

There's a bunch of similar setups and there are a couple of dozen people that have done something similar on /r/localllama.

adakbar · 2024-12-24T10:19:44 1735035584

I'd like to know too

ThinkBeat · 2024-12-25T12:38:31 1735130311

What motherboard are you using to have space and ports for 7 of them?

slavik81 · 2024-12-25T15:46:24 1735141584

The ASRock Rack ROMED8-2T has seven PCIe x16 slots. They're too close together to directly put seven 4090s on the board, but you'd just need some riser cables to mount the cards on a frame.

zozbot234 · 2024-12-24T09:03:19 1735030999

It looks like AMD's CDNA gpu's are supported by Mesa, which ought to suffice for Vulkan Compute and SYCL support. So there should be ways to run ML workloads on the hardware without going through HIP/ROCm.

shihab · 2024-12-24T01:17:32 1735003052

I have come across quite few startups who are trying a similar idea: break the nvidia monopoly by utilizing AMD GPUs (for inference at least): Felafax, Lamini, tensorwave (partially), SlashML. Even saw optimistic claims like CUDA moat is only 18 months deep from some of them [1]. Let's see.

[1] https://www.linkedin.com/feed/update/urn:li:activity:7275885...

pinsiang · 2024-12-24T04:30:02 1735014602

AMD GPUs are becoming a serious contender for LLM inference. vLLM is already showing impressive performance on AMD [1], even with consumer-grade Radeon cards (even support GGUF) [2]. This could be a game-changer for folks who want to run LLMs without shelling out for expensive NVIDIA hardware.

[1] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html [2] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...

treprinum · 2024-12-24T19:17:06 1735067826

AMD decided not to release a high-end GPU this cycle so any investment into 7x00 or 6x00 is going to be wasted as Nvidia 5x00 is likely going to destroy any ROI from the older cards and AMD won't have an answer for at least two years, possibly never due to being non-existing in high-end consumer GPUs usable for compute.

MrBuddyCasino · 2024-12-24T14:38:43 1735051123

Fun fact: Nvidia H200 are currently half the price/hr of H100 bc people can’t get vLLM to work on it.

https://x.com/nisten/status/1871325538335486049

ryao · 2024-12-24T22:29:40 1735079380

That is GH200 and it is likely due to an amd64 dependency in vLLM.

adrian_b · 2024-12-24T16:24:05 1735057445

That seems like a CPU problem, not a GPU problem (due to Aarch64 replacing x86-64).

ryukoposting · 2024-12-24T01:28:50 1735003730

Peculiar business model, at a glance. It seems like they're doing work that AMD ought to be doing, and is probably doing behind the scenes. Who is the customer for a third-party GPU driver shim?

dpkirchner · 2024-12-24T01:48:34 1735004914

Could be trying to make themselves a target for a big acquihire.

to11mtm · 2024-12-24T03:13:19 1735009999

Cynical take: Try to get acquired by Intel for Arc.

dogma1138 · 2024-12-24T05:52:31 1735019551

Intel is in a vastly better shape than AMD, they have the software pretty much nailed down.

lhl · 2024-12-24T07:20:28 1735024828

I've recently been poking around with Intel oneAPI and IPEX-LLM. While there are things that I find refreshing (like their ability to actually respond to bug reports in a timely manner, or at all) on a whole, support/maturity actually doesn't match the current state of ROCm.

PyTorch requires it's own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work - both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked, etc, etc.

moffkalast · 2024-12-24T12:50:49 1735044649

I've recently been trying to get IPEX working as well, apparently picking Ubuntu 24.04 was a mistake, because while things compile, everything fails at runtime. I've tried native, docker, different oneAPI versions, threw away a solid week of afternoons for nothing.

SYCL with llama.cpp is great though, at least at FP16 since it supports nothing else but even Arc iGPUs easily give 2-4x performance compared to CPU inference.

Intel should've just contributed to SYCL instead of trying to make their own thing and then forgot to keep maintaining it halfway through.

lhl · 2024-12-24T13:22:11 1735046531

My testing has been w/ a Lunar Lake Core 258V chip (Xe2 - Arc 140V) on Arch Linux. It sounds like you've tried a lot of things already, but case it helps, my notes for installing llama.cpp and PyTorch: https://llm-tracker.info/howto/Intel-GPUs

I have some benchmarks as well, and the IPEX-LLM backend performed a fair bit better than the SYCL llama.cpp backend for me (almost +50% pp512 and almost 2X tg128) so worth getting it working if you plan on using llama.cpp much on an Intel system. SYCL still performs significantly better than Vulkan and CPU backends, though.

As an end-user, I agree that it'd be way better if they could just contribute upstream somehow (whether to the SYCL backend, or if not possible, to a dependency-minized IPEX backend). the IPEX backend is one of the more maintained parts of IPEX-LLM, btw. I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."

moffkalast · 2024-12-24T14:25:02 1735050302

Well that's funny, I think we already spoke on Reddit. I'm the guy who was testing the 125H recently. I guess there's like 5 of us who have intel hardware in total and we keep running into each other :P

Honestly I think there's just something seriously broken with the way IPEX expects the GPU driver to be on 24.04 and there's nothing I can really do about it except wait for them to fix it if I want to keep using this OS.

I am vaguely considering adding another drive and installing 22.04 or 20.04 with the exact kernel they want to see if that might finally work in the meantime, but honestly I'm fairly satisfied with the speed I get from SYCL already. The problem is more that it's annoying to integrate it directly through the server endpoint, every projects expects a damn ollama api or llama-cpp-python these days and I'm a fan of neither since it's just another layer of headaches to get those compiled with SYCL.

> I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."

Yeah well the fact that oneAPI 2025 got released, broke IPEX, and they still haven't figured out a way to patch it for months makes me think it's total chaos internally, where teams work against each other instead of talking and coordinating.

0xDEADFED5 · 2024-12-25T01:54:55 1735091695

Fwiw on 22.04 i can use current kernel but otherwise follow Intel's instructions and the stuff works (old as it is now). I'm currently trying to figure out the best way to finetune Qwen 2.5 3B, the old axolotl ain't up to it. Not sure if I'm gonna work on a fork of axolotl or try something else at this point.

0xDEADFED5 · 2024-12-25T01:46:57 1735091217

Big agree on Intel working on SYCL. I've millions of tasks thru SYCL llama.cpp at this point, and though SYCL reliably does 5-6x the prompt processing speed of the Vulkan builds, current Vulkan builds are now up to 50% faster at token generation than SYCL on my Intel GPU

indolering · 2024-12-24T07:11:20 1735024280

Tell that to the board.

bboygravity · 2024-12-24T07:17:29 1735024649

Someone never used intel killer wifi software.

dangero · 2024-12-24T05:32:33 1735018353

More cynical take: Trying to get acquired by nvidia

dizhn · 2024-12-24T07:08:15 1735024095

Person below says they (the whole team) already joined Nvidia.

shiroiushi · 2024-12-24T03:58:00 1735012680

More cynical take: this would be a bad strategy, because Intel hasn't shown much competence in its leadership for a long time, especially in regards to GPUs.

rockskon · 2024-12-24T04:51:42 1735015902

They've actually been making positive moves with GPUs lately along with a success story for the B580.

kimixa · 2024-12-24T08:22:13 1735028533

B580 being a "success" is purely a business decision as a loss leader to get their name into the market. A larger die on a newer node than either Nvidia or AMD means their per-unit costs are higher, and are selling it at a lower price.

That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.

bitmasher9 · 2024-12-24T15:26:19 1735053979

It’s a long term strategy to release a hardware platform with minimal margins in the beginning to attract software support needed for long term viability.

One of the benefits of being Intel.

jvanderbot · 2024-12-24T12:03:12 1735041792

I was reading this whole thread as about technical accomplishment and non-nvidia GPU capabilities, not business. So I think you're talking about different definitions of "Success". Definitely counts, but not what I was reading.

ryao · 2024-12-24T22:46:48 1735080408

Is it a loss leader? I looked up the price of 16Gbit GDDR6 ICs the other day at dramexchange and the cost of 12GB is $48. Using the gamer nexus die measurements, we can calculate that they get at least 214 dies per wafer. At $12095 per wafer, which is reportedly the price at TSMC for 5nm wafers in 2025, that is $57 per die.

While defects ordinarily reduce yields, Intel put plenty of redundant transistors into the silicon. This is ordinarily not possible to estimate, but Tom Petersen reported in his interview with hardware unboxed that they did not count those when reporting the transistor count. Given that the density based on reported transistors is about 40% less than the density others get from the same process and the silicon in GPUs is already fairly redundant, they likely have a backup component for just about everything on the die. The consequence is that they should be able to use at least 99% of those dies even after tossing unusable dies, such that the $57 per die figure is likely correct.

As for the rest of the card, there is not much in it that would not be part of the price of an $80 Asrock motherboard. The main thing would be the bundled game, which they likely can get in bulk at around $5 per copy. This seems reasonable given how much Epic games pays for their giveaways:

https://x.com/simoncarless/status/1389297530341519362

That brings the total cost to $190. If we assume Asrock and the retailer both have a 10% margin on the $80 motherboard used as a substitute for the costs of the rest of the things, then it is $174. Then we need to add margins for board partners and the retailers. If we assume they both get 10% of the $250, then that leaves a $26 profit for Intel, provided that they have economics of scale such that the $80 motherboard approximation for the rest of the cost of the graphics card is accurate.

That is about a 10% margin for Intel. That is not a huge margin, but provided enough sales volume (to match the sales volume Asrock gets on their $80 motherboards), Intel should turn a profit on these versus not selling these at all. Interestingly, their board partners are not able/willing to hit the $250 MSRP and the closest they come to it is $260 so Intel is likely not sharing very much with them.

It should be noted that Tom Petersen claimed during his hardware unboxed interview that they were not making money on these. However, that predated the B580 being a hit and likely relied on expected low production volumes due to low sales projections. Since the B580 is a hit and napkin math says it is profitable as long as they build enough of them, I imagine that they are ramping production to meet demand and reach profitability.

SixtyHurtz · 2024-12-25T13:33:58 1735133638

That's just BOM. When you factor in R&D they are clearly still losing money on B580. There's no way they can recoup R&D this generation with a 10% gross margin.

Still, that's to be expected considering this is still only the second generation of Arc. If they can break even on the next gen, that would be an accomplishment.

ryao · 2024-12-25T15:48:52 1735141732

To be fair, the R&D is shared with Intel’s integrated graphics as they use the same IP blocks, so they really only need to recoup the R&D that was needed to turn that into a discrete GPU. I do not know how much that is to make any definitive statements, but I can speculate that if it is $50 million and they sell 10 million of these, they more than recoup it. Even if they fail to recoup their R&D funds, they would be losing more money by not selling these at all, since no sales means 0 dollars of R&D would be recouped.

7speter · 2024-12-24T18:31:34 1735065094

I don’t know if this matters but while the B580 has a die comparable in size to a 4070 (~280mm^2), it has about half the transistors (~17-18 billion), iirc.

ryao · 2024-12-24T23:49:21 1735084161

Tom Petersen said in a hardware unboxed video that they only reported “active” transistors, such that there are more transistors in the B580 than what they reported. I do not think this is the correct way to report them since one, TSMC counts all transistors when reporting the density of their process and two, Intel is unlikely to reduce the reported transistor count for the B570, which will certainly have fewer active transistors.

That said, the 4070 die is 294mm^2 while the B580 die is 272mm^2.

schmidtleonard · 2024-12-24T05:14:00 1735017240

Yeah but MLID says they are losing money on every one and have been winding down the internal development resources. That doesn't bode well for the future.

I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.

sodality2 · 2024-12-24T05:45:02 1735019102

MLID on Intel is starting to become the same as UserBenchmark on AMD (except for the generally reputable sources)... he's beginning to sound like he simply wants Intel to fail, to my insider-info-lacking ears. For competition's sake I really hope that MLID has it wrong (at least the opining about the imminent failure of Intel's GPU division), and that the B series will encourage Intel to push farther to spark more competition in the GPU space.

ryao · 2024-12-25T02:36:37 1735094197

My analysis is that the B580 is profitable if they build enough of them:

https://news.ycombinator.com/item?id=42505496

The margins might be describable as razor thin, but they are there. Whether it can recoup the R&D that they spent designing it is hard to say definitively since I do not have numbers for their R&D costs. However, their iGPUs share the same IP blocks, so the iGPUs should be able to recoup the R&D costs that they have in common with the discrete version. Presumably, Intel can recoup the costs specific to the discrete version if they sell enough discrete cards.

While this is not a great picture, it is not terrible either. As long as Intel keeps improving its graphics technology with each generation, profitability should gradually improve. Although I have no insider knowledge, I noticed a few things that they could change to improve their profitability in the next generation:

  * Tom Petersen made a big deal about 16-lane SIMD in Battlemage being what games want rather than the 8-lane SIMD in Alchemist. However, that is not quite true since both Nvidia and AMD graphics use 32-lane SIMD. If the number of lanes really matter and I certainly can see how it would if game shaders have horizontal operations, then a switch to 32-lane SIMD should yield further improvements.
  * Tom Petersen said in his interview with Hardware Unboxed that Intel reported the active transistor count for the B580 rather than the total transistor count. This is the contrary to others who report the total transistor count (as evidenced by their density figures being close to what TSMC claims the process can do). Tom Petersen also stated that they would not necessarily be forced by defects to turn dies into B570 cards. This suggests to me that they have substantial redundant logic in the GPU to prevent defects from rendering chips unusable, and that logic is intended to be disabled in production. GPUs are already highly redundant. They could drop much of the planned dark silicon and let defects force a larger percentage of the dies to be usable by only cutdown models.

I could have read too much into things that Tom Petersen said. Then again, he did say that their design team is conservative and the doubling rather than quadrupling of the SIMD lane count and the sheer amount of dark silicon (>40% of the die by my calculation) spent on what should be redundant components strike me as conservative design choices. Hopefully the next generation addresses these things.

Also, they really do have >40% dark silicon when doing density comparisons:

  * ARC B580: 72.1M / mm²
  * Nvidia 4070 Ti: 121.8M / mm²
  * TSMC claim for 5nm: 138.2M / mm²

They have 41% less density than Nvidia and 48% less density than TSMC claims the process can obtain. We also know that they have additional transistors on the die that are not active from Tom Petersen’s comments. Presumably, they are for redundancy. Otherwise, there really is no sane explanation that I can see for so much dark silicon. If they are using transistors that are twice the size as the density figure might be interpreted to suggest, they might as well have used TSMC’s 7nm process since while a smaller process can etch larger features, it is a waste of money.

Note that we can rule out the cache lowering the density. The L1 + L2 cache on the 4070 Ti is 79872 KB while it is 59392 KB on the B580. We can also rule out IO logic as lowering the density, as the 4070 Ti has a 256-bit memory bus while the B580 has a 192-bit memory bus.

https://www.techpowerup.com/gpu-specs/arc-b580.c4244

https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...

https://en.wikipedia.org/wiki/5_nm_process#Nodes

The hardware unboxed interview of Tom Petersen is here:

https://youtu.be/XYZyai-xjNM

Dylan16807 · 2024-12-25T12:47:12 1735130832

> Tom Petersen made a big deal about 16-lane SIMD in Battlemage [...]

Where? The only mention I see in that interview is him briefly saying they have native 16 with "simple emulation" for 32 because some games want 32. I see no mention of or comparison to 8.

And it doesn't make sense to me that switching to actual 32 would be an improvement. Wider means less flexible here. I'd say a more accurate framing is whether the control circuitry is 1/8 or 1/16 or 1/32. Faking extra width is the part that is useful and also pretty easy.

ryao · 2024-12-25T15:19:47 1735139987

For context, Alchemist was SIMD8 in Intel’s terminology. They made a big deal out of this at the alchemist launch if I recall correctly since they thought it would be more efficient. Unfortunately, it turned out to not be more efficient.

Anyway, Tom Petersen did a bunch of interviews before the Intel B580 launch. In the hardware unboxed interview, he mentioned it, but accidentally misspoke. I must have interpreted his misspeak as meaning games want SIMD16 and noted it that way in my mind, as what he says elsewhere seems to suggest that games want SIMD16. It was only after thinking about what I heard that I realized otherwise. Here is an interview where he talks about native SIMD16 being better:

https://www.youtube.com/live/z7mjKeck7k0?t=35m38s

In specific, he says:

> We also have native SIMD support, SIMD16 native support, which is going to say that you don’t have to like recode your computer shader to match a particular topology. You can use the one that you use for everyone else, and it’ll just run well on ARC. So I’m pretty exited about that.

In an interview with gamers nexus, he has a nice slide where he attributes a performance gain directly to SIMD16:

https://youtu.be/ACOlBthEFUw?t=16m35s

At the start of the gamers nexus video, Steve mentions that Tom‘s slides are from a presentation. I vaguely remember seeing a video of it where he talked about SIMD16 being an improvement more, but I am having trouble finding it.

As for 32 lane SIMD being an improvement over 16 lanes, while I do not write shaders, I have written CUDA kernels and in CUDA kernels, you sometimes need to do what Nvidia calls a parallel reduction across lanes (Intel’s CPU division calls them horizontal operations). For example, you might need to sum across all lanes in order to calculate an average. When you have native 32 lane SIMD, you can do this without going to shared memory, which is extremely fast. If you need to emulate a higher lane width, you need to do a trip to shared memory key, which is not as fast. If games shaders are written with an assumption that 32 lane SIMD is used, then having 32 lane SIMD is going to be more performant of these. Intel’s slide attributes a 0.3ms reduction in render time to 16 lane SIMD and they likely would see a further reduction with 32 lane SIMD since that is what games should actually use as that is what both AMD (since RDNA 1) and Nvidia (since Turing) use.

oofabz · 2024-12-24T06:01:58 1735020118

The die size of the B580 is 272 mm2, which is a lot of silicon for $249. The performance of the GPU is good for its price but bad for its die size. Manufacturing cost is closely tied to die size.

272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.

tjoff · 2024-12-24T09:15:56 1735031756

Though you assume the prices of the competition are reasonable. There are plenty of reasons for them not to be. Availability issues, lack of competition, other more lucrative avenues etc.

Intel has neither, or at least not as much of them.

KeplerBoy · 2024-12-24T13:18:43 1735046323

At a loss seems a bit overly dramatic. I'd guess Nvidia sells SKUs for three times their marginal cost. Intel is probably operating at cost without any hopes of recouping R&D with the current SKUs, but that's reasonable for an aspiring competitor.

7speter · 2024-12-24T18:34:51 1735065291

It kinda seems they are covering the cost of throwing massive amounts of resources trying to get Arc’s drivers in shape.

KeplerBoy · 2024-12-24T23:03:05 1735081385

I really hope they stick with it and become a viable competitor in every market segment a few more years down the line.

ryao · 2024-12-25T00:40:30 1735087230

The idea that Intel is selling these at a loss does not sound reasonable to me:

https://news.ycombinator.com/item?id=42505496

The only way this would be at a loss is if they refuse to raise production to meet demand. That said, I believe their margins on these are unusually low for the industry. They might even fall into razor thin territory.

derektank · 2024-12-24T05:40:45 1735018845

Wait, are they losing money on every one in the sense that they haven't broken even on research and development yet? Or in the sense that they cost more to manufacture than they're sold at? Because one is much worse than the other.

ryao · 2024-12-25T01:45:27 1735091127

The former is likely true, but the latter is not:

https://news.ycombinator.com/item?id=42505496

That being said, the IP blocks are shared by their iGPUs, so the discrete GPUs do not need to recoup the costs of most of the R&D, as it would have been done anyway for the iGPUs.

rockskon · 2024-12-24T07:52:23 1735026743

They're trying to unseat Radeon as the budget card. That means making a more enticing offer than AMD for a temporary period of time.

ryao · 2024-12-25T00:09:05 1735085345

That guy’s reasoning is faulty. To start, he has made math mistakes in every video that he has posted recently involving math. To give 3 recent examples:

At 10m3s in the following video, he claims to add a 60% margin by multiplying by 1.6, but in reality is adding a 37.5 margin and needed to multiply by 2.5 to add a 60% margin. This can be calculated by calculating Cost Scaling Factor = 1 / (1 - Normalized Profit Margin):

2.5 = 1 / (1 - 0.6)

1.6 = 1 / (1 - 0.375)

https://youtu.be/pq5G4mPOOPQ

At 48m13s in the following video, he claims that Intel’s B580 is 80% worse than Nvidia’s hardware. He took the 4070 Ti as being 82% better than the 2080 SUPER, assumed based on leaks from his reviewer friends that the B580 was about at the performance of the 2080 SUPER and then claimed that the B580 would be around 80% worse than the 4070 Ti. Unfortunately for him, that is 45% worse, not 80% worse. His chart is from Techpowerup and if he had taken the time to do some math (1 - 1/(1 + 0.82) ~ 0.45), or clicked to the 2080 SUPER page, he would have seen it has 55% of the performance of the 4070 Ti, which is 45% worse:

https://youtu.be/-lv52n078dw

At 1m2s in the following video, he makes a similar math mistake by saying that the B580 has 8% better price/performance than the RTX 3060 when in fact it is 9% better. He mistakenly equated the RTX 3060 being 8% worse than the B580 to mean that it is 8% better, but math does not work that way. Luckily for him, the math error is small here, but he still failed to do math correctly and his reasoning grows increasingly faulty with the scale of his math errors. What he should have done that gives the correct normalized factor is:

1.09 ~ 1 / (1 - 0.08)

A factor of 1.09 better is 9% better.

https://youtu.be/3jy6GDGzgbg

He not just fails at mathematical reasoning, but lacks a basic understanding of how hardware manufacturing works. He said that if Intel loses $20 per card in low production volumes, then making 10 million cards will result in a $200 million loss. In reality, things become cheaper due to economics of scale and simple napkin math shows that they can turn a profit on these cards:

https://news.ycombinator.com/item?id=42505496

His $20 loss per card remark is at 11m40s:

https://youtu.be/3jy6GDGzgbg

His behavior is consistent with being on a vendetta rather than being a technology journalist. For example, at 55m13s in the following video, he puts words in Tom Petersen’s mouth and then with a malicious smile on his mouth, cheers while claiming that Tom Petersen declared discrete ARC cards to be dead when Tom Petersen said nothing of the kind. Earlier in the same video at around 44m14s, he calls Tom Petersen a professional liar. However, he sees no problem expecting people to believe words he shoved into the “liar’s” mouth:

https://youtu.be/xVKcmGKQyXU

If you scrutinize his replies to criticism in his comments section, you would see he is dodging criticism of the actual issues with his coverage while saying “I was right about <insert thing completely unrelated to the complaint here>” or “facts don’t care about your feelings”. You would also notice that he is copy and pasting the same statements rather than writing replies addressing the details of the complaints. To be clear, I am paraphrasing in those two quotes.

He also shows contempt for his viewers that object to his behavior in the following video around 18m53s where he calls them “corporate cheerleaders”:

https://youtu.be/pq5G4mPOOPQ

In short, Tom at MLID is unable to do mathematical reasoning, does not understand how hardware manufacturing works, has a clear vendetta against Intel’s discrete graphics, is unable to take constructive criticism and lashes out at those who try to tell him when he is wrong. I suggest being skeptical of anything he says about Intel’s graphics division.

dboreham · 2024-12-24T03:25:05 1735010705

> Could be trying to make themselves a target for a big acquihire.

Is this something anyone sets out to do?

ryukoposting · 2024-12-24T03:30:51 1735011051

It definitely is, yes.

seeknotfind · 2024-12-24T03:26:55 1735010815

tesch1 · 2024-12-24T01:38:11 1735004291

AMD. Just one more dot to connect ;)

dylan604 · 2024-12-24T02:40:34 1735008034

It would be interesting to find out AMD is funding these other companies to ensure the shim happens while they focus on not doing it.

bushbaba · 2024-12-24T03:21:37 1735010497

AMD is kind of doing that funding by pricing its GPUs low and/or giving them away at cost to these startups

shmerl · 2024-12-24T02:48:44 1735008524

Is this effort benefiting everyone? I.e. where is it going / is it open source?

britannio · 2024-12-24T11:55:57 1735041357

Some of the work from Tinycorp is: https://github.com/tinygrad/7900xtx

llama-mini · 2024-12-24T05:04:09 1735016649

From Lamini, we have a private AMD GPU cluster, ready to serve any one who want to try MI300x or MI250 with inference and tuning.

We just onboarded a customer to move from openai API to on-prem solution, currently evaluating MI300x for inference.

Email me at my profile email.

3abiton · 2024-12-24T13:59:29 1735048769

My understanding is that once JAX takes off, the cuda advantage is gone for nvidia. That's a big if/when though.

jsheard · 2024-12-24T01:38:50 1735004330

Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.

nomel · 2024-12-24T03:29:13 1735010953

This is discussed in the lex Friedman episode. AMD’s own demo would kernel panic when run in a loop [1].

[1] https://youtube.com/watch?v=dNrTrx42DGQ&t=3218

kranke155 · 2024-12-24T09:19:56 1735031996

Interesting. I wonder if focusing on GPUs and CPUs is something that requires two companies instead of one, whether the concentration of resources just leads to one arm of your company being much better than the other.

halJordan · 2024-12-25T01:07:27 1735088847

Nvidia maintains a competitive cpu...

kranke155 · 2024-12-25T11:29:41 1735126181

I had no idea. Thanks for sharing.

noch · 2024-12-24T05:52:33 1735019553

> Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.

From their announcement on 20241219[^0]:

"We are the only company to get AMD on MLPerf, and we have a completely custom driver that's 50x simpler than the stock one. A bit shocked by how little AMD cared, but we'll take the trillions instead of them."

From 20241211[^1]:

"We gave up and soon tinygrad will depend on 0 AMD code except what's required by code signing.

We did this for the 7900XTX (tinybox red). If AMD was thinking strategically, they'd be begging us to take some free MI300s to add support for it."

---

[^0]: https://x.com/__tinygrad__/status/1869620002015572023

[^1]: https://x.com/__tinygrad__/status/1866889544299319606

jroesch · 2024-12-24T01:28:09 1735003689

Note: this is old work, and much of the team working on TVM, and MLC were from OctoAI and we have all recently joined NVIDIA.

sebmellen · 2024-12-24T01:55:44 1735005344

Is there no hope for AMD anymore? After George Hotz/Tinygrad gave up on AMD I feel there’s no realistic chance of using their chips to break the CUDA dominance.

comex · 2024-12-24T03:04:41 1735009481

Maybe from Modular (the company Chris Lattner is working for). In this recent announcement they said they had achieved competitive ML performance… on NVIDIA GPUs, but with their own custom stack completely replacing CUDA. And they’re targeting AMD next.

https://www.modular.com/blog/introducing-max-24-6-a-gpu-nati...

behnamoh · 2024-12-24T03:13:12 1735009992

Ah yes, the programming language (Mojo) that requires an account before I can use it...

melodyogonna · 2024-12-24T06:33:46 1735022026

Mojo no longer requires an account to install.

But that is irrelevant to the conversation because this is not about Mojo but something they call MAX. [1]

1. https://www.modular.com/max

steeve · 2024-12-24T17:42:49 1735062169

We (ZML) have AMD MI300X working just fine, in fact, faster than H100

fweimer · 2024-12-25T12:29:25 1735129765

Isn't AMD rather strong in the HPC space?

Quite frankly, I have difficulty reconciling a lot of comments here with that, and my own experience as an AMD GPU user (although not for compute, and not on Windows).

latchkey · 2024-12-24T02:03:58 1735005838

https://x.com/dylan522p/status/1871287937268383867

krackers · 2024-12-24T02:36:53 1735007813

That's almost word for word what geohotz said last year?

refulgentis · 2024-12-24T02:49:08 1735008548

What part?

I assume the part where she said there's "gaps in the software stack", because that's the only part that's attributed to her.

But I must be wrong because that hasn't been in dispute or in the news in a decade, it's not a geohot discovery from last year.

Hell I remember a subargument of a subargument re: this being an issue a decade ago in macOS dev (TL;Dr whether to invest in opencl)

bn-l · 2024-12-24T02:53:47 1735008827

I went through the thread. There’s an argument to be made in firing Su for being so spaced out as to miss an op for their own CUDA for free.

hedgehog · 2024-12-24T03:18:32 1735010312

Not remotely, how did you get to that idea?

refulgentis · 2024-12-24T05:45:34 1735019134

Kids this days (shakes fist)

tl;dr there's a non-unsubstantial # of people who learn a lot from geohot. I'd say about 3% of people here will be confused if you thought of him as less than a top technical expert across many comp sci fields.

And he did the geohot thing recently, way tl;dr: acted like there was a scandal being covered up by AMD around drivers that was causing them to "lose" to nVidia.

He then framed AMD not engaging with him on this topic as further covering-up and choosing to lose.

So if you're of a certain set of experiences, you see an anodyne quote from the CEO that would have been utterly unsurprising dating back to when ATI was still a company, and you'd read it as the CEO breezily admitting in public that geohot was right about how there was malfeasance, followed by a cover up, implying extreme dereliction of duty, because she either helped or didn't realize till now.

I'd argue this is partially due to stonk-ification of discussions, there was a vague, yet often communicated, sense there was something illegal happening. Idea was it was financial dereliction of duty to shareholders.

dismalaf · 2024-12-24T02:13:23 1735006403

IMO the hope shouldn't be that AMD specifically wins, rather it's best for consumers that hardware becomes commoditized and prices come down.

And that's what's happening, slowly anyway. Google, Apple and Amazon all have their own AI chips, Intel has Gaudi, AMD had their thing, and the software is at least working on more than just Nvidia. Which is a win. Even if it's not perfect. I'm personally hoping that everyone piles in on a standard like SYCL.

quotemstr · 2024-12-24T03:57:58 1735012678

The world is bigger than AMD and Nvidia. Plenty of interesting new AI-tuned non-GPU accelerators coming online.

grigio · 2024-12-24T08:04:11 1735027451

I hope, name some NPU who can run a 70B model..

llm_trw · 2024-12-24T02:01:11 1735005671

Not really.

AMD is constitutionally incapable of shipping anything but mid range hardware that requires no innovation.

The only reason why they are doing so well in CPUs right now is that Intel has basically destroyed itself without any outside help.

adrian_b · 2024-12-24T18:21:19 1735064479

In CPUs, AMD has made many innovations that have been copied by Intel only after many years and this delay had an important contribution to Intel's downfall.

The most important has been the fact that AMD has predicted correctly that big monolithic CPUs will no longer be feasible in the future CMOS fabrication technologies, so they have designed the Zen family since the beginning with a chiplet-based architecture. Intel had attempted to ridicule them, but after losing many billions they have been forced to copy this strategy.

Also in the microarchitecture of their CPUs AMD has made the right choices since the beginning and then they have improved it constantly with each generation. The result is that now the latest Intel big core, Lion Cove, has a microarchitecture that is much more similar to AMD Zen 5 than to any of the previous Intel cores, because they had to do this to get a competitive core.

In the distant past, AMD has also introduced a lot of innovations long before they were copied by Intel, but it is true that those had not been invented by AMD, but they had been copied by AMD from more expensive CPUs, like DEC Alpha or Cray or IBM POWER, but Intel has also copied them only after being forced by the competition with AMD.

ksec · 2024-12-24T03:36:18 1735011378

Everything is comparative. AMD isn't perfect. As an Ex Shareholder I have argued they did well partly because of Intel's downfall. In terms of execution it is far from perfect.

But Nvidia is a different beast. It is a bit like Apple in the late 00s where you take business, forecast, marketing, operation, software, hardware, sales etc You take any part of it and they are all industry leading. And having industry leading capability is only part of the game, having it all work together is completely another thing. And unlike Apple where they lost direction once Steve Jobs passed away and weren't sure about how to deploy capital. Jensen is still here, and they have more resources now making Nvidia even more competitive.

It is often most people underestimate the magnitude of the task required, ( I like to tell the story again about an Intel GPU engineer in 2016 arguing they could take dGPU market shares by 2020, and we are now 2025 ), over estimate the capability of an organisation, under estimate the rival's speed of innovation and execution. These three thing combined is why most people are often off the estimate by an order of magnitude.

llm_trw · 2024-12-24T03:54:00 1735012440

Yeah, no.

We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now. I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.

By comparison if AMD could write a driver that didn't shit itself when it had to multiply more than two matrices in a row they'd be selling cards faster than they can make them. You don't need to sell the best shovels in a gold rush to make mountains of money, but you can't sell teaspoons as premium shovels and expect people to come back.

ksec · 2024-12-24T05:30:12 1735018212

>We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now.

I am not sure which part of Nvidia is monopoly. That is like suggesting TSMC has a monopoly.

vitus · 2024-12-24T13:48:36 1735048116

> That is like suggesting TSMC has a monopoly.

They... do have a monopoly on foundry capacity, especially if you're looking at the most advanced nodes? Nobody's going to Intel or Samsung to build 3nm processors. Hell, there have been whispers over the past month that even Samsung might start outsourcing Exynos to TSMC; Intel already did that with Lunar Lake.

Having a monopoly doesn't mean that you are engaging in anticompetitive behavior, just that you are the only real option in town.

Vecr · 2024-12-24T11:06:15 1735038375

Will they? Given the structure of global controls on GPUs, Nvidia is a de-facto self funding US government company.

Maybe the US will do something if GPU price becomes the limit instead of the supply of chips and power.

kadoban · 2024-12-24T04:41:43 1735015303

What effect did the DOJ have on MS in the 90s? Didn't all of that get rolled back before they had to pay a dime, and all it amounted to was that browser choice screen that was around for a while? Hardly a crippling blow. If anything that showed the weakness of regulators in fights against big tech, just outlast them and you're fine.

shiroiushi · 2024-12-24T04:08:27 1735013307

>I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.

It sounds like you're expecting extreme competence from the DOJ. Given their history with regulating big tech companies, and even worse, the incoming administration, I think this is a very unrealistic expectation.

perching_aix · 2024-12-24T02:14:55 1735006495

And I'm supposed to believe that HN is this amazing platform for technology and science discussions, totally unlike its peers...

zamadatix · 2024-12-24T02:34:40 1735007680

The above take is worded a bit cynical but is their general approach to GPUs lately across the board e.g. https://www.techpowerup.com/326415/amd-confirms-retreat-from...

Also I'd take HN as being being an amazing platform for the overall consistency and quality of moderation. Anything beyond that depends more on who you're talking to than where at.

petesergeant · 2024-12-24T02:47:22 1735008442

Maybe be the change you want to see and tell us what the real story is?

perching_aix · 2024-12-24T09:03:46 1735031026

We seem to disagree on what the change in the world I'd like to see is like, which is a real shocker I'm sure.

Personally, I think that's when somebody who has no real information to contribute doesn't try to pretend that they do.

So thanks for the offer, but I think I'm already delivering on that realm.

shadowgovt · 2024-12-25T00:40:10 1735087210

Oh, there's basically no chance of getting that on the Internet.

The Internet is a machine that highly simplifies the otherwise complex technical challenge of wide-casting ignorance. It wide-casts wisdom too, but it's an exercise for the reader to distinguish them.

llm_trw · 2024-12-24T03:49:38 1735012178

I don't really care what you believe.

Everyone whose dug deep into what AMD is doing has left in disgust if they are lucky and bankruptcy if they are not.

If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.

AnthonyMouse · 2024-12-24T05:59:50 1735019990

> If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.

This seems like unuseful advice if you've already given up on them.

You tried it and at some point in the past it wasn't ready. But by not being ready they're losing money, so they have a direct incentive to fix it. Which would take a certain amount of time, but once you've given up you no longer know if they've done it yet or not, at which point your advice would be stale.

Meanwhile the people who attempt it apparently seem to get acquired by Nvidia, for some strange reason. Which implies it should be a worthwhile thing to do. If they've fixed it by now which you wouldn't know if you've stopped looking, or they fix it in the near future, you have a competitive advantage because you have access to lower cost GPUs than your rivals. If not, but you've demonstrated a serious attempt to fix it for everyone yourself, Nvidia comes to you with a sack full of money to make sure you don't finish, and then you get a sack full of money. That's win/win, so rather than nobody doing it, it seems like everybody should be doing it.

llm_trw · 2024-12-24T07:37:06 1735025826

I've tried it three times.

I've seen people try it every six months for two decades now.

At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.

I'm deeply worried about stagnation in the CPU space now that they are top dog and Intel is dead in the water.

Here's hoping China and Risk V save us.

>Meanwhile the people who attempt it apparently seem to get acquired by Nvidia

Everyone I've seen base jumping has gotten a sponsorship from redbull, ergo. everyone should basejump.

Ignore the red smears around the parking lot.

AnthonyMouse · 2024-12-24T18:40:59 1735065659

> At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.

AMD has always punched above their weight. Historically their problem was that they were the much smaller company and under heavy resource constraints.

Around the turn of the century the Athlon was faster than the Pentium III and then they made x86 64-bit when Intel was trying to screw everyone with Itanic. But the Pentium 4 was a marketing-optimized design that maximized clock speed at the expense of heat and performance per clock. Intel was outselling them even though the Athlon 64 was at least as good if not better. The Pentium 4 was rubbish for laptops because of the heat problems, so Intel eventually had to design a separate chip for that, but they also had the resources to do it.

That was the point that AMD made their biggest mistake. When they set out to design their next chip the competition was the Pentium 4, so they made a power-hungry monster designed to hit high clock speeds at the expense of performance per clock. But the reason more people didn't buy the Athlon 64 wasn't that they couldn't figure out that a 2.4GHz CPU could be faster than a 2.8GHz CPU, it was all the anti-competitive shenanigans Intel was doing behind closed doors to e.g. keep PC OEMs from featuring systems with AMD CPUs. Meanwhile by then Intel had figured out that the Pentium 4 was, in fact, a bad design, when their own Pentium M laptops started outperforming the Pentium 4 desktops. So the Pentium 4 line got canceled and Bulldozer had to go up against the Pentium M-based Core, which nearly bankrupted AMD and compromised their ability to fund the R&D needed to sustain state of the art fabs.

Since then they've been climbing back out of the hole but it wasn't until Ryzen in 2017 that you could safely conclude they weren't on the verge of bankruptcy, and even then they were saddled with a lot of debt and contracts requiring them to use the uncompetitive Global Foundries fabs for several years. It wasn't until Zen4 in 2022 that they finally got to switch the whole package to TSMC.

So until quite recently the answer to the question "why didn't they do X?" was obvious. They didn't have the money. But now they do.

Dylan16807 · 2024-12-25T13:01:25 1735131685

> So until quite recently the answer to the question "why didn't they do X?" was obvious. They didn't have the money. But now they do.

Seven and a half years.

The excuse is threadbare at best. They are not doing a reasonable job of making compute work off the shelf.

Const-me · 2024-12-24T12:18:05 1735042685

> I've tried it three times

Have you tried compute shaders instead of that weird HPC-only stuff?

Compute shaders are widely used by millions of gamers every day. GPU vendors have huge incentive to make them reliable and efficient: modern game engines are using them for lots of thing, e.g. UE5 can even render triangle meshes with GPU compute instead of graphics (the tech is called nanite virtualized geometry). In practice they work fine on all GPUs, ML included: https://github.com/Const-me/Cgml

perching_aix · 2024-12-24T09:09:06 1735031346

I'd be very concerned if somebody makes a $100K decision based on a comment where the author couldn't even differentiate between the words "constitutionally" and "institutionally", while providing as much substance as any other random techbro on any random forum and being overwhelmingly oblivious to it.

lofaszvanitt · 2024-12-24T05:55:22 1735019722

It had to destroy itself. These companies do not act on their own...

zamalek · 2024-12-24T02:03:47 1735005827

I have been playing around with Phi-4 Q6 on my 7950x and 7900XT (with HSA_OVERRIDE_GFX_VERSION). It's bloody fast, even with CPU alone - in practical terms it beats hosted models due to the roundtrip time. Obviously perf is more important if you're hosting this stuff, but we've definitely reached AMD usability at home.

slavik81 · 2024-12-25T05:34:51 1735104891

If you're not using your iGPU, you can disable it in BIOS and you won't need to set HSA_OVERRIDE_GFX_VERSION.

throwaway314155 · 2024-12-24T01:00:13 1735002013

> Aug 9, 2023

Ignoring the very old (in ML time) date of the article...

What's the catch? People are still struggling with this a year later so I have to assume it doesn't work as well as claimed.

I'm guessing this is buggy in practice and only works for the HF models they chose to test with?

Const-me · 2024-12-24T01:19:34 1735003174

It’s not terribly hard to port ML inference to alternative GPU APIs. I did it for D3D11 and the performance is pretty good too: https://github.com/Const-me/Cgml

The only catch is, for some reason developers of ML libraries like PyTorch aren’t interested in open GPU APIs like D3D or Vulkan. Instead, they focus on proprietary ones i.e. CUDA and to lesser extent ROCm. I don’t know why that is.

D3D-based videogames are heavily using GPU compute for more than a decade now. Since Valve shipped SteamDeck, the same now applies to Vulkan on Linux. By now, both technologies are stable, reliable and performant.

jsheard · 2024-12-24T01:44:26 1735004666

Isn't part of it because the first-party libraries like cuDNN are only available through CUDA? Nvidia has poured a ton of effort into tuning those libraries so it's hard to justify not using them.

Const-me · 2024-12-24T01:59:12 1735005552

Unlike training, ML inference is almost always bound by memory bandwidth as opposed to computations. For this reason, tensor cores, cuDNN, and other advanced shenanigans make very little sense for the use case.

OTOH, general-purpose compute instead of fixed-function blocks used by cuDNN enables custom compression algorithms for these weights which does help, by saving memory bandwidth. For example, I did custom 5 bits/weight quantization which works on all GPUs, no hardware support necessary, just simple HLSL codes: https://github.com/Const-me/Cgml?tab=readme-ov-file#bcml1-co...

boroboro4 · 2024-12-24T03:03:53 1735009433

Only local (read batch size 1) ML inference is memory bound, production loads are pretty much compute bound. Prefill phase is very compute bound, and with continuous batching generation phase is getting mixed with prefill, which makes whole process altogether to be compute bound too. So no, tensor cores and all other shenanigans absolutely critical for performant inference infrastructure.

Const-me · 2024-12-24T03:51:04 1735012264

PyTorch is a project by Linux foundation. The about page with the mission of the foundation contains phrases like “empowering generations of open source innovators”, “democratize code”, and “removing barriers to adoption”.

I would argue running local inference with batch size=1 is more useful for empowering innovators compared to running production loads on shared servers owned by companies. Local inference increases count of potential innovators by orders of magnitude.

BTW, in the long run it may also benefit these companies because in theory, an easy migration path from CUDA puts a downward pressure on nVidia’s prices.

idonotknowwhy · 2024-12-24T04:27:42 1735014462

Most people running local inference do so thorough quants with llamacpp (which runs on everything) or awq/exl2/mlx with vllm/tabbyAPI/lmstudio which are much faster to than using pytorch directly

lhl · 2024-12-24T13:47:56 1735048076

It depends on what you mean by "this." MLC's catch is that you need to define/compile models for it with TVM. Here is the list of supported model architectures: https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/m...

llama.cpp has a much bigger supported model list, as does vLLM and of course PyTorch/HF transformers covers everything else, all of which work w/ ROCm on RDNA3 w/o too much fuss these days.

For inference, the biggest caveat is that Flash Attention is only an aotriton implementation, which besides being less performant sometimes, also doesn't support SWA. For CDNA there is a better CK-based version of FA, but CK doesn't not have RDNA support. There are a couple people at AMD apparently working on native FlexAttention, os I guess we'll how that turns out.

(Note the recent SemiAccurate piece was on training, which I'd agree is in a much worse state (I have personal experience with it being often broken for even the simplest distributed training runs). Funnily enough, if you're running simple fine tunes on a single RDNA3 card, you'll probably have a better time. OOTB, a 7900 XTX will train at about the same speed as an RTX 3090 (4090s blow both of those away, but you'll probably want more cards and VRAM of just move to H100s).

mattfrommars · 2024-12-24T04:05:24 1735013124

Great, I have yet to understand why does not the ML community really push or move away from CUDA? To me, it feel like a dinosaur move to build on top of CUDA which is screaming proprietary nothing about it is open source or cross platform.

The reason why I say its dinosaur is, imagine, we as a dev community continued to build on top of Flash or Microsoft Silverlight...

LLM and ML has been out for quiet a while, with AI/LLM advancement, the transition must have been much quicker to move cross platform. But this hasn't yet and not sure when it will happen.

Building a translation layer on top CUDA is not the answer either to this problem.

idonotknowwhy · 2024-12-24T04:19:05 1735013945

For me personally, hacking together projects as a hobbiest, 2 reasons :

1. It just works. When i tried to build things on Intel Arcs, i spent way more hours bikeshedding ipex and driver issues than developing

2. LLMs seem to have more cuda code in their training data. I can leverage claude and 4o to help me build things with cuda, but trying to get them to help me do the same things on ipex just doesn't work.

I'd very much love a translation layer for Cuda, like a dxvk or wine equivalent.

Would save a lot of money since Arc gpus are in the bargain bin and nvidia cloud servers are double the price of AMD.

As it stands now, my dual Intel Arc rig is now just a llama.cpp inference server for the family to use.

jeroenhd · 2024-12-24T22:14:27 1735078467

If CUDA counts as “just works”, I dread to see the dark, unholy rituals you need to invoke to get ROCm to work. I have spent too many hours browsing the Nvidia forums for obscure error codes and driver messages to ever consider updating my CUDA install and every time I reboot my desktop for an update I dread having to do it all over again.

ryao · 2024-12-25T04:51:46 1735102306

https://github.com/gpuocelot/gpuocelot

https://github.com/vosen/ZLUDA

FloatArtifact · 2024-12-24T08:25:34 1735028734

What kind of model learn and what's its token output on intel gpu's?

dwood_dev · 2024-12-24T04:11:43 1735013503

Except I never hear complaints about CUDA from a quality perspective. The complaints are always about lock in to the best GPUs on the market. The desire to shift away is to make cheaper hardware with inferior software quality more usable. Flash was an abomination, CUDA is not.

AnthonyMouse · 2024-12-24T06:07:41 1735020461

Flash was popular because it was an attractive platform for the developer. Back then there was no HTML5 and browsers didn't otherwise support a lot of the things Flash did. Flash Player was an abomination, it was crashy and full of security vulnerabilities, but that was a problem for the user rather than the developer and it was the developer choosing what to use to make the site.

This is pretty much exactly what happens with CUDA. Developers like it but then the users have to use expensive hardware with proprietary drivers/firmware, which is the relevant abomination. But users have some ability to influence developers, so as soon as we get the GPU equivalent of HTML5, what happens?

wqaatwt · 2024-12-24T08:36:44 1735029404

> users have to use expensive hardware with proprietary drivers/firmware

What do you mean by that? People trying to run their own models are not “the users” they are a tiny insignificant niche segment.

AnthonyMouse · 2024-12-24T17:45:54 1735062354

There are far more people running llama.cpp, various image generators, etc. than there are people developing that code. Even when the "users" are corporate entities, they're not necessarily doing any development in excess of integrating the existing code with their other systems.

We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.

What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.

xedrac · 2024-12-24T04:14:32 1735013672

Maybe the situation has gotten better in recent years, but my experience with Nvidia toolchains was a complete nightmare back in 2018.

claytonjy · 2024-12-24T04:54:40 1735016080

The cuda situation is definitely better. The nvidia struggles are now with the higher-level software they’re pushing (triton, tensor-llm, riva, etc), tools that are the most performant option when they work, but a garbage developer experience when you step outside the golden path

cameron_b · 2024-12-24T17:07:38 1735060058

I want to double-down on this statement, and call attention to the competitive nature of it. Specifically, I have recently tried to set up Triton on arm hardware. One might presume Nvidia would give attention to an architecture they develop, but the way forward is not easy. For some version of Ubuntu, you might have the correct version of python ( usually older than packaged ) but current LTS is out of luck for guidance or packages.

https://github.com/triton-lang/triton/issues/4978

claytonjy · 2024-12-25T00:16:49 1735085809

I think you’ve mixed up your Triton’s; I’m talking about Triton Inference Server from NVIDIA while you’re talking about Triton the CUDA replacement from OpenAI

lasermike026 · 2024-12-24T01:35:27 1735004127

I believe these efforts are very important. If we want this stuff to be practical we are going to have to work on efficiency. Price efficiency is good. Power and compute efficiency would be better.

I have been playing with llama.cpp to run interference on conventional cpus. No conclusions but it's interesting. I need to look at llamafile next.

latchkey · 2024-12-24T02:05:03 1735005903

Previously:

Making AMD GPUs competitive for LLM inference https://news.ycombinator.com/item?id=37066522 (August 9, 2023 — 354 points, 132 comments)

lxe · 2024-12-24T02:44:51 1735008291

A used 3090 is $600-900, performs better than 7900, and is much more versatile because CUDA

Uehreka · 2024-12-24T03:11:05 1735009865

Reality check for anyone considering this: I just got a used 3090 for $900 last month. It works great.

I would not recommend buying one for $600, it probably either won’t arrive or will be broken. Someone will reply saying they got one for $600 and it works, that doesn’t mean it will happen if you do it.

I’d say the market is realistically $900-1100, maybe $800 if you know the person or can watch the card running first.

All that said, this advice will expire in a month or two when the 5090 comes out.

idonotknowwhy · 2024-12-24T04:22:11 1735014131

I've bought 5 used and they're all perfect. But that's what buyer protection on ebay is for. Had to send back an Epyc mobo with bent pins and ebay handled it fine.

fireant · 2024-12-25T04:46:52 1735102012

I've bought used 3090 last year for ML and while it works fine, has correct DRAM and stuff, when I tried gaming on it I've noticed that it is significantly slower than my 3080. I'm not sure if the seller has pulled some shenanigans on me or the card actually degraded during whatever mining they did.

Just beware, the card might be "working fine" on a first glance, but actually be damaged.

ryao · 2024-12-25T04:11:24 1735099884

I got a refurbished $800 3090 Ti FE earlier this year from microcenter. Sadly, they sold out and never restocked.

coolspot · 2024-12-25T12:11:14 1735128674

Zotac official website has refurb 3090 ti for $899

melodyogonna · 2024-12-24T16:14:01 1735056841

Modular claims that it achieves 93% GPU utilization on AMD GPUs [1], official preview release coming early next year, we'll see. I must say I'm bullish because of feedback I've seen people give about the performance on Nvidia GPUs

1.https://www.modular.com/max

lhl · 2024-12-24T07:14:54 1735024494

Just an FYI, this is writeup from August 2023 and a lot has changed (for the better!) for RDNA3 AI/ML support.

That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).

That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).

Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.

At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.

There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).

Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs

[1] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...

dragontamer · 2024-12-24T00:58:11 1735001891

Intriguing. I thought AMD GPUs didn't have tensor cores (or matrix multiplication units) like NVidia. I believe they are only dot product / fused multiply and accumulate instructions.

Are these LLMs just absurdly memory bound so it doesn't matter?

boroboro4 · 2024-12-24T03:08:56 1735009736

They absolutely do have similar cores to tensor cores, it's called matrix cores. And they have particular instructions to utilize them (MFMA). Note I'm talking about DC compute chips, like MI300.

LLMs aren't memory bound in production loads, they are pretty much compute bound too, at least in prefill phase, but in practice in general too.

almostgotcaught · 2024-12-24T03:12:15 1735009935

Ya people in these comments don't know what they're talking about (no one ever does in these threads). AMDGPU has had MMA and WMMA for a while now

https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...

ryao · 2024-12-24T01:02:11 1735002131

They don’t, but GPUs were designed for doing matrix multiplications even without the special hardware instructions for doing matrix multiplication tiles. Also, the forward pass for transformers is memory bound, and that is what does token generation.

dragontamer · 2024-12-24T01:14:33 1735002873

Well sure, but in other GPU tasks, like Raytracing, the difference between these GPUs is far more pronounced.

And AMD has passable Raytracing units (NVidias are better but the difference is bigger than these LLM results).

If RAM is the main bottleneck then CPUs should be on the table.

IX-103 · 2024-12-24T01:39:32 1735004372

> If RAM is the main bottleneck then CPUs should be on the table

That's certainly not the case. The graphics memory model is very different from the CPU memory model. Graphics memory is explicitly designed for multiple simultaneous reads (spread across several different buses) at the cost of generality (only portions of memory may be available on each bus) and speed (the extra complexity means reads are slower). This makes then fast at doing simple operations on a large amount of data.

CPU memory only has one bus, so only a single read can happen at a time (a cache line read), but can happen relatively quickly. So CPUs are better for workloads with high memory locality and frequent reuse of memory locations (as is common in procedural programs).

dragontamer · 2024-12-24T06:20:48 1735021248

> CPU memory only has one bus

If people are paying $15,000 or more per GPU, then I can choose $15,000 CPUs like EPYC that have 12-channels or dual-socket 24-channel RAM.

Even desktop CPUs are dual-channel at a minimum, and arguably DDR5 is closer to 2 or 4 buses per channel.

Now yes, GPU RAM can be faster, but guess what?

https://www.tomshardware.com/pc-components/cpus/amd-crafts-c...

GPUs are about extremely parallel performance, above and beyond what traditional single-threaded (or limited-SIMD) CPUs can do.

But if you're waiting on RAM anyway?? Then the compute-method doesn't matter. Its all about RAM.

ryao · 2024-12-25T04:53:12 1735102392

Where are these GPUs with multiple buses? I only know of GPUs with wide buses.

webmaven · 2024-12-24T01:31:02 1735003862

RAM is (often) the bottleneck for highly parallel GPUs, but not for CPUs.

Though the distinction between the two categories is blurring.

ryao · 2024-12-24T22:24:52 1735079092

Memory bandwidth is the bottleneck for both when running GEMV, which is the main operation used by token generation in inference. It has always been this way.

schmidtleonard · 2024-12-24T01:52:30 1735005150

CPUs have pitiful RAM bandwidth compared to GPUs. The speeds aren't so different but GPU RAM busses are wiiiiiiiide.

teleforce · 2024-12-24T02:28:27 1735007307

Compute Express Link (CXL) should mostly solve limited RAM with CPU:

1) Compute Express Link (CXL):

https://en.wikipedia.org/wiki/Compute_Express_Link

PCIe vs. CXL for Memory and Storage:

https://news.ycombinator.com/item?id=38125885

schmidtleonard · 2024-12-24T02:56:51 1735009011

Gigabytes per second? What is this, bandwidth for ants?

My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.

teleforce · 2024-12-24T04:03:43 1735013023

Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].

[1] Forget ChatGPT: why researchers now run small AIs on their laptops:

https://news.ycombinator.com/item?id=41609393

[2] Welcome to LLMflation – LLM inference cost is going down fast:

https://a16z.com/llmflation-llm-inference-cost/

[3] New LLM optimization technique slashes memory costs up to 75%:

https://news.ycombinator.com/item?id=42411409

ryao · 2024-12-25T04:26:45 1735100805

I have been working on my own local inference software:

https://github.com/ryao/llama3.c/blob/master/run.c

First, CXL is useless as far as I am concerned.

The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.

Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.

schmidtleonard · 2024-12-24T13:32:33 1735047153

No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.

That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.

Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.

(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)

ryao · 2024-12-24T22:23:57 1735079037

It is reportedly 242GB/sec due to overhead:

https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_7.0

Dylan16807 · 2024-12-24T05:41:30 1735018890

> 4 times the bandwidth you would get from a PCIe Gen 7 x16 link

So you have a full terabyte per second of bandwidth? What GPU is that?

(The 64GB/s number is an x4 link. If you meant you have over four times that, then it sounds like CXL would be pretty competitive.)

schmidtleonard · 2024-12-24T13:49:42 1735048182

https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

    Memory Size: 24 GB
    Memory Type: GDDR6X
    Memory Bus: 384 bit
    Bandwidth: 1.01 TB/s

Bandwidth between where the LLM is stored and where your matrix*vector multiplies are done is the important figure for inference. You want to measure this in terabytes per second, not gigabytes per second.

A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.

Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.

It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.

Dylan16807 · 2024-12-24T18:38:00 1735065480

A 4090 is not "years old pleb tier". Same for 3090 and 7900XTX.

There's a serious gap between CXL and RAM, but it's not nearly as big as it used to be.

ryao · 2024-12-24T22:20:23 1735078823

The 3090 Ti and 4090 both have 1.01TB/sec memory bandwidth:

https://www.techpowerup.com/gpu-specs/geforce-rtx-3090-ti.c3...

Dylan16807 · 2024-12-25T01:17:38 1735089458

But as I addressed earlier, those are not "years old pleb tier".

adrian_b · 2024-12-24T16:36:15 1735058175

Already an ancient Radeon VII from 5 years ago had 1 terabyte per second of memory bandwidth.

Later consumer GPUs have regressed and only RTX 4090 offers the same memory bandwidth in the current NVIDIA generation.

Dylan16807 · 2024-12-24T18:40:05 1735065605

Radeon VII had HBM.

So I can understand a call for returning to HBM, but it's an expensive choice and doesn't fit the description.

ryao · 2024-12-24T22:21:47 1735078907

That seems unlikely given that the full HBM supply for the next year has been earmarked for enterprise GPUs. That said, it would be definitely nice if HBM became available for consumer GPUs.

fc417fc802 · 2024-12-24T09:10:23 1735031423

RTX 4090 comes to mind. Dunno that I'd consider that a "years old pleb tier non-HBM GPU" though.

ryao · 2024-12-24T22:17:07 1735078627

The main bottleneck is memory bandwidth. CPUs have less memory bandwidth than GPUs.

throwaway314155 · 2024-12-24T01:02:21 1735002141

> Are these LLMs just absurdly memory bound so it doesn't matter?

During inference? Definitely. Training is another story.

mrcsharp · 2024-12-24T07:53:31 1735026811

I will only consider AMD GPUs for LLM when I can easily make my AMD GPU available within WSL and Docker on Windows.

For now, it is as if AMD does not exist in this field for me.

e-max · 2024-12-25T01:53:40 1735091620

Isn't it already available somehow? I didn't test it seriously, I just needed to quickly run Whisper but

  $ rocminfo  | grep -E "WSL|XTX"
  WSL environment detected.
  Marketing Name:          AMD Radeon RX 7900 XTX

Sparkyte · 2024-12-24T12:05:02 1735041902

More players in the market the better. AI shouldn't be owned by one business.

sroussey · 2024-12-24T00:59:07 1735001947

[2023]

Btw, this is from MLC-LLM which makes WebLLM and other good stuff.

guerrilla · 2024-12-24T20:09:13 1735070953

So, does ollama use this work or does it do something else? How does it compare?

aussieguy1234 · 2024-12-24T06:43:31 1735022611

I got a "gaming" PC for LLM inference with an RTX 3060. I could have gotten more VRAM for my buck with AMD, but didn't because at the time alot of inference needed CUDA.

As soon AMD is as good as Nvidia for inference, I'll switch over.

But I've read on here that their hardware engineers aren't even given enough hardware to test with...

leonewton253 · 2024-12-24T03:05:31 1735009531

This benchmark doest look right. Is it using the tensor cores in the Nvidia gpu? AMD does not have AI cores so should run noticeably slower.

nomel · 2024-12-24T03:32:43 1735011163

AMD has WMMA.