Any chance these work on CPUs with any acceptable performance?
I have a 10-core 20-thread monster CPU, but didn't bother with a dedicated GPU because I can't control something as simple as its temperature. See the complicated procedure that only works with the large proprietary driver here:
I don't know about these large models but I saw on a random HN comment earlier in a different topic where someone showed a GPT-J model on CPU only: https://github.com/ggerganov/ggml
I tested it on my Linux and Macbook M1 Air and it generates tokens at a reasonable speed using CPU only. I noticed it doesn't quite use all my available CPU cores so it may be leaving some performance on the table, not sure though.
The GPT-J 6B is nowhere near as large as the OPT-175B in the post. But I got the sense that CPU-only inference may not be totally hopeless even for large models if only we got some high quality software to do it.
There's also the Fabrice Bellard inference code: https://textsynth.com/technology.html. He claims up to 41 tokens per second on the GPT-Neox 20B model.
Your CPU gets maybe 700-800 gflops depending on your all-core frequency (fp32 because you don't have Sapphire Rapids.) The T4 benchmarked would be crunching what it can at ~65 tflops (fp16 tensor.) Newer GPUs hit 300 tflops (4090) or even nearly 2 petaflops (H100).
To give you an idea of the order of magnitude of compute difference. Sapphire Rapids has AMX and fp16 AVX512 to close the gap a little, but it's still massive.
With what, 50GB/s memory bandwidth? That's no monster. The two consumer GPUs in my machine both do 1TB/s and are still bottlenecked on memory bandwidth.
> only works with the large proprietary driver here
In practice, nothing works without the proprietary driver so this isn't specific to temperature. Also the setting you're looking for is almost certainly `nvidia-smi -pl $watts` for setting the power limit, not whatever that wiki gives you. GPU temperature = ambient temperature + (power limit)*(thermal resistance of cooler)
The other answer give you a few of the current solutions.
In the long term I am hoping that JAX (/XLA) will get better support for the CPU backend of their compiler and in particular, use SIMD and multicore better than it currently does.
It is very doable (just low priority) and it would mean that a lot of models could get close to optimal CPU performances out of the box which would be a step forward for accessibility.
I have a 10-core 20-thread monster CPU, but didn't bother with a dedicated GPU because I can't control something as simple as its temperature. See the complicated procedure that only works with the large proprietary driver here:
https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...