Both. Cheaper CPU-based inference, GPUs are not as competitive for sparse linear...

valine · on Nov 22, 2023

GPU utilization should be down when using this technique. I’m hoping this could allow for more efficient batch inference on GPUs. If you can predict 10 tokens for the price of 1 it should allow you to do tree of thought much more efficiently.

https://github.com/princeton-nlp/tree-of-thought-llm

swalsh · on Nov 22, 2023

Has anyone used SIMD instructions to try and speed up cpu inference?

singhrac · on Nov 22, 2023

A lot of CPU inference libraries (llama.cpp included) use as much SIMD as possible, sometimes by hand-writing loops. The one I hack on, llama.rs, uses portable_simd but specializes to your CPU at compile time.

My experience has been that most CPU inference is actually not compute limited, but memory bandwidth limited, since most weights are used for a few operations per token (how quickly can you load and unload the entire 70 GB of weights into your registers?). It's not quite that bad but I found most vectorization changes didn't meaningfully change performance.

anonymousDan · on Nov 22, 2023

Would you say that is the state of the art CPU inference library?

Kubuxu · on Nov 23, 2023

ggml.cpp with blast backend could be one example of it See for instance: https://github.com/ggerganov/ggml/blob/57c468b8655f3630d1749... which are the parts not available in blast

hobofan · on Nov 22, 2023

Most inference builds on top of BLAS libraries, which in their implementation take advantage of SIMD.

WithinReason · on Nov 22, 2023

Note this doesn't speed up training