We wrote our inference engine on Rust, it is faster than llama cpp in all of the use cases. Your feedback is very welcomed. Written from scratch with idea that you can add support of any kernel and platform.
How was your experience using Rust on this project? I'm considering a project in an adjacent space and I'm trying to decide between Rust, C, and Zig. Rust seems a bit burdensome with its complexity compared to C and Zig. Reminds me of C++ in its complexity (although not as bad). I find it difficult to walk through and understand a complicated Rust repository. I don't have that problem with C and Zig for the most part.
But I'm wondering if I just need to invest more time in Rust. How was your learning curve with the language?
You are confusing familiarity with intrinsic complexity. I have 20 years experience with C/C++ before switching to rust a few years ago. After the initial hurdle, it is way easier and very simple to follow.
Are you generally able to quickly understand what is going on in somebody else's codebase written in Rust? I find it quite difficult to understand other people's Rust code. Is this just a familiarity thing? I have not written anything particularly huge or complex in Rust, but I have written a few CLI utilities. With an equivalent level of Go exposure, I find it much easier to understand code written in Go, compared to code written in Rust.
I'm quite proficient in C/C++ (started coding in C/C++ in 1997) but I still have a much harder time understanding a new C++ project compared to a C project.
Hoping the author can answer, I'm still learning about how this all works. My understanding is that inference is "using the model" so to speak. How is this faster than established inference engines specifically on Mac? Are models generic enough that if you build e.g. an inference engine focused on AMD GPUs or even Intel GPUs, would they achieve reasonable performance? I always assumed because Nvidia is king of AI that you had to suck it up, or is it just that most inference engines being used are married to Nvidia?
I would love to understand how universal these models can become.
Basically “faster” means better performance e.g. tokens/s without loosing quality (benchmarks scores for models). So when we say faster we provide more tokens per second than llama cpp. That means we effectively utilize hardware API available (for example we wrote our own kernels) to perform better.
I just spun up a AWS EC2 g6.xlarge instance to do some llm work. The GPU is NVIDIA L4 24GB and costs $0.8048/per hour. Starting to think about switching to an Apple mac2-m2.metal instance for $0.878/ per hour. Big question is the Mac instance only has 24GB of unified memory.
You're right, modern edge devices are powerful enough to run small models, so the real bottleneck for a forward pass is usually memory bandwidth, which defines the upper theoretical limit for inference speed. Right now, we've figured out how to run computations in a granular way on specific processing units, but we expect the real benefits to come later when we add support for VLMs and advanced speculative decoding, where you process more than one token at a time
Ollama isn't an inference engine, its a GUI slapped onto a perpetually out-of-date vendored copy of Llama.cpp underneath.
So, if you're trying to actually count LLama.cpp downloads, you'd combine those two. Also, I imagine most users on OSX aren't using Homebrew, they're getting it directly from the GH releases, so you'd also have to count those.
It's utilizing Apple ANE and probably other optimization tools provided by Apple's framework. Not sure if llama.cpp uses them, but if they're not then the benchmark on GitHub says it all.
...or D? or Go? or Java? C#? Zig? etc they chose what they were most comfortable with. Rust is fine, it's not for everyone clearly, but those who use it produce high quality software, I would argue similar with Go, without all the unnecessary mental overhead of C or C++
How was your experience using Rust on this project? I'm considering a project in an adjacent space and I'm trying to decide between Rust, C, and Zig. Rust seems a bit burdensome with its complexity compared to C and Zig. Reminds me of C++ in its complexity (although not as bad). I find it difficult to walk through and understand a complicated Rust repository. I don't have that problem with C and Zig for the most part.
But I'm wondering if I just need to invest more time in Rust. How was your learning curve with the language?
reply