CPUs have pitiful RAM bandwidth compared to GPUs. The speeds aren't so different...

teleforce · 2024-12-24T02:28:27 1735007307

Compute Express Link (CXL) should mostly solve limited RAM with CPU:

1) Compute Express Link (CXL):

https://en.wikipedia.org/wiki/Compute_Express_Link

PCIe vs. CXL for Memory and Storage:

https://news.ycombinator.com/item?id=38125885

schmidtleonard · 2024-12-24T02:56:51 1735009011

Gigabytes per second? What is this, bandwidth for ants?

My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.

teleforce · 2024-12-24T04:03:43 1735013023

Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].

[1] Forget ChatGPT: why researchers now run small AIs on their laptops:

https://news.ycombinator.com/item?id=41609393

[2] Welcome to LLMflation – LLM inference cost is going down fast:

https://a16z.com/llmflation-llm-inference-cost/

[3] New LLM optimization technique slashes memory costs up to 75%:

https://news.ycombinator.com/item?id=42411409

ryao · 2024-12-25T04:26:45 1735100805

I have been working on my own local inference software:

https://github.com/ryao/llama3.c/blob/master/run.c

First, CXL is useless as far as I am concerned.

The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.

Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.

schmidtleonard · 2024-12-24T13:32:33 1735047153

No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.

That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.

Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.

(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)

ryao · 2024-12-24T22:23:57 1735079037

It is reportedly 242GB/sec due to overhead:

https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_7.0

Dylan16807 · 2024-12-24T05:41:30 1735018890

> 4 times the bandwidth you would get from a PCIe Gen 7 x16 link

So you have a full terabyte per second of bandwidth? What GPU is that?

(The 64GB/s number is an x4 link. If you meant you have over four times that, then it sounds like CXL would be pretty competitive.)

schmidtleonard · 2024-12-24T13:49:42 1735048182

https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

    Memory Size: 24 GB
    Memory Type: GDDR6X
    Memory Bus: 384 bit
    Bandwidth: 1.01 TB/s

Bandwidth between where the LLM is stored and where your matrix*vector multiplies are done is the important figure for inference. You want to measure this in terabytes per second, not gigabytes per second.

A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.

Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.

It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.

Dylan16807 · 2024-12-24T18:38:00 1735065480

A 4090 is not "years old pleb tier". Same for 3090 and 7900XTX.

There's a serious gap between CXL and RAM, but it's not nearly as big as it used to be.

ryao · 2024-12-24T22:20:23 1735078823

The 3090 Ti and 4090 both have 1.01TB/sec memory bandwidth:

https://www.techpowerup.com/gpu-specs/geforce-rtx-3090-ti.c3...

Dylan16807 · 2024-12-25T01:17:38 1735089458

But as I addressed earlier, those are not "years old pleb tier".

adrian_b · 2024-12-24T16:36:15 1735058175

Already an ancient Radeon VII from 5 years ago had 1 terabyte per second of memory bandwidth.

Later consumer GPUs have regressed and only RTX 4090 offers the same memory bandwidth in the current NVIDIA generation.

Dylan16807 · 2024-12-24T18:40:05 1735065605

Radeon VII had HBM.

So I can understand a call for returning to HBM, but it's an expensive choice and doesn't fit the description.

ryao · 2024-12-24T22:21:47 1735078907

That seems unlikely given that the full HBM supply for the next year has been earmarked for enterprise GPUs. That said, it would be definitely nice if HBM became available for consumer GPUs.

fc417fc802 · 2024-12-24T09:10:23 1735031423

RTX 4090 comes to mind. Dunno that I'd consider that a "years old pleb tier non-HBM GPU" though.