LLMs seem to be a bit more accessible than some other ML models though, because on a good CPU, even LLaMA2 70b is borderline usable (bit under a token/second LLaMA2 70b on an AMD Ryzen 7950X3D, using ~40 GiB of RAM.) Combined with RAM being relatively cheap, seems to me like this is the most accessible option for most folks. While an AMD Ryzen 7950X3D or Intel Core i9 13900K are relatively expensive parts, they're not that bad (you could probably price out two entire rigs for less than the cost of a single RTX 4090) and as a bonus, you get pretty excellent performance for code compilation, rendering, and whatever other CPU-bound tasks you might have. If you're like me and you already have been buying expensive CPUs to speed up code compilation, the fact that you can just run llama.cpp to mess around is merely a bonus.
Everyone uses (byte pair encoding)[https://en.wikipedia.org/wiki/Byte_pair_encoding] to generate their tokens; the tokens are whatever emerge from this. They will typically correspond to the most common substrings in the training corpus in, handwaving a bit, a max-cover sense; it's an encoding which attempts to best compress the data the tokenizer was trained on.
My amateur intuition, having played around with local llms a little bit and seeing things load a token at a time, is that they're conceptually like if you took all n-grams for all lengths n, then sorted them by frequency in the training data, and truncated that list at some point. So the most common words, or even most common words+punctuation, will be one token, less common words with "normal" spelling will be a few tokens, while unusual words with atypical letter combinations will be many tokens. So, e.g., " the" will probably be one token, but "qzxv" will probably be four, depending on what the training set was (something mostly trained on Wikipedia will have different tokens than something mostly trained on code).
More common words can be just one token, but most words will be a few tokens. A token is neither a character nor a word, it's more like a word fragment.
I'd like to see some benchmarks. For one thing, I suspect you'd at least want an X3D model for AMD, due to the better cache. But for another, at least according to top, llama.cpp does seem to manage to saturate all of the cores during inference. (Although I didn't try messing around much; I know X3D CPUs do not give all cores "3D V-Cache" so it's possible that limiting inference to just those cores would be beneficial.)
For me it's OK though, since I want faster compile times anyway, so it's worth the money. To me local LLMs are just a curiosity.
> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.
You'd really expect DDR5-6000 to be advantageous. I think that AMD Ryzen 7xxx can at least take advantage to up to 5600. Does it perhaps not wind up bottlenecking on memory? Maybe quantization plays a role...
The big cache is irrelevant for this use case. You're memory bandwidth bound, with a substantial portion of the model read for each token, so that a 128MB cache doesn't help.
>> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.
That's referring specifically to prompt processing, which uses a batch processing optimization not used in normal inference. The processed prompt can also be cached so you only need to process it again if you change it. Normal inference benefits from faster RAM.