LLMs seem to be a bit more accessible than some other ML models though, because ...

wing-_-nuts · on Dec 11, 2023

>bit under a token/second

When you say 'token' is this a word? A character? I've never gotten a good definition for it beyond 'a unit of text the llm processes'

adw · on Dec 11, 2023

It is a unit of text the LLM processes. :-)

Everyone uses (byte pair encoding)[https://en.wikipedia.org/wiki/Byte_pair_encoding] to generate their tokens; the tokens are whatever emerge from this. They will typically correspond to the most common substrings in the training corpus in, handwaving a bit, a max-cover sense; it's an encoding which attempts to best compress the data the tokenizer was trained on.

tga_d · on Dec 11, 2023

My amateur intuition, having played around with local llms a little bit and seeing things load a token at a time, is that they're conceptually like if you took all n-grams for all lengths n, then sorted them by frequency in the training data, and truncated that list at some point. So the most common words, or even most common words+punctuation, will be one token, less common words with "normal" spelling will be a few tokens, while unusual words with atypical letter combinations will be many tokens. So, e.g., " the" will probably be one token, but "qzxv" will probably be four, depending on what the training set was (something mostly trained on Wikipedia will have different tokens than something mostly trained on code).

jchw · on Dec 11, 2023

More common words can be just one token, but most words will be a few tokens. A token is neither a character nor a word, it's more like a word fragment.

kcorbitt · on Dec 11, 2023

A good way to build intuition for how much text fits in a token is by pasting a block of text into a tokenizer playground, like this one: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

pulse7 · on Dec 11, 2023

1 Token ~= 3/4 word

mrob · on Dec 11, 2023

LLM inference is bottlenecked by memory bandwidth. You'll probably get identical speed with cheaper CPUs.

jchw · on Dec 11, 2023

I'd like to see some benchmarks. For one thing, I suspect you'd at least want an X3D model for AMD, due to the better cache. But for another, at least according to top, llama.cpp does seem to manage to saturate all of the cores during inference. (Although I didn't try messing around much; I know X3D CPUs do not give all cores "3D V-Cache" so it's possible that limiting inference to just those cores would be beneficial.)

For me it's OK though, since I want faster compile times anyway, so it's worth the money. To me local LLMs are just a curiosity.

edit: Interesting information here. https://old.reddit.com/r/LocalLLaMA/comments/14ilo0t/extensi...

> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.

You'd really expect DDR5-6000 to be advantageous. I think that AMD Ryzen 7xxx can at least take advantage to up to 5600. Does it perhaps not wind up bottlenecking on memory? Maybe quantization plays a role...

my123 · on Dec 11, 2023

The big cache is irrelevant for this use case. You're memory bandwidth bound, with a substantial portion of the model read for each token, so that a 128MB cache doesn't help.

mrob · on Dec 11, 2023

>> RAM speed does not matter. The processing time is identical with DDR-6000 and DDR-4000 RAM.

That's referring specifically to prompt processing, which uses a batch processing optimization not used in normal inference. The processed prompt can also be cached so you only need to process it again if you change it. Normal inference benefits from faster RAM.

dannyw · on Dec 11, 2023

Yep, get the fastest memory you can.

I wish there were affordable platforms with quad DDR5.

milkcr4t3 · on Dec 11, 2023

The cache size of those 3d CPUs should definitely play some sort of role.

I can only speculate that it would help mitigate latency with loose timings on a fast OC among other things.