Gigabytes per second? What is this, bandwidth for ants?
My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.
Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].
[1] Forget ChatGPT: why researchers now run small AIs on their laptops:
The smaller LLM stuff in 1 and 2 is overrated. LLMs get plenty of things wrong and while the capabilities of small LLMs is increasing, they just are never as good as the larger LLMs in my testing. To give an example, between a small LLM that gets things right 20% of the time and a large one that gets things right 40% of the time, you are never going to want to deal the small one. Even when they improvement you will just find new things that they are not able to do well. At least, that is my experience.
Finally, the 75% savings figure in 3 is misleading. It applies to the context, not the LLMs themselves. It is very likely that nobody will use it, since it is a form of lossy compression that will ruin the ability of the LLM to repeat what is in its memory.
No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.
That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.
Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.
(EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)
Bandwidth between where the LLM is stored and where your matrix*vector multiplies are done is the important figure for inference. You want to measure this in terabytes per second, not gigabytes per second.
A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.
Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.
It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.
That seems unlikely given that the full HBM supply for the next year has been earmarked for enterprise GPUs. That said, it would be definitely nice if HBM became available for consumer GPUs.