Both. Cheaper CPU-based inference, GPUs are not as competitive for sparse linear algebra.
This could lead to much larger models, as you only touch a small portion of the matrix during inference. However, the training here is still dense-LA on a GPU, so you still blow up the compute cost when increasing model size.
GPU utilization should be down when using this technique. I’m hoping this could allow for more efficient batch inference on GPUs. If you can predict 10 tokens for the price of 1 it should allow you to do tree of thought much more efficiently.
A lot of CPU inference libraries (llama.cpp included) use as much SIMD as possible, sometimes by hand-writing loops. The one I hack on, llama.rs, uses portable_simd but specializes to your CPU at compile time.
My experience has been that most CPU inference is actually not compute limited, but memory bandwidth limited, since most weights are used for a few operations per token (how quickly can you load and unload the entire 70 GB of weights into your registers?). It's not quite that bad but I found most vectorization changes didn't meaningfully change performance.