Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Both. Cheaper CPU-based inference, GPUs are not as competitive for sparse linear algebra. This could lead to much larger models, as you only touch a small portion of the matrix during inference. However, the training here is still dense-LA on a GPU, so you still blow up the compute cost when increasing model size.


GPU utilization should be down when using this technique. I’m hoping this could allow for more efficient batch inference on GPUs. If you can predict 10 tokens for the price of 1 it should allow you to do tree of thought much more efficiently.

https://github.com/princeton-nlp/tree-of-thought-llm


Has anyone used SIMD instructions to try and speed up cpu inference?


A lot of CPU inference libraries (llama.cpp included) use as much SIMD as possible, sometimes by hand-writing loops. The one I hack on, llama.rs, uses portable_simd but specializes to your CPU at compile time.

My experience has been that most CPU inference is actually not compute limited, but memory bandwidth limited, since most weights are used for a few operations per token (how quickly can you load and unload the entire 70 GB of weights into your registers?). It's not quite that bad but I found most vectorization changes didn't meaningfully change performance.


Would you say that is the state of the art CPU inference library?


ggml.cpp with blast backend could be one example of it See for instance: https://github.com/ggerganov/ggml/blob/57c468b8655f3630d1749... which are the parts not available in blast


Most inference builds on top of BLAS libraries, which in their implementation take advantage of SIMD.


Note this doesn't speed up training




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: