Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> At least in most public tokenizers like o200k, addition in certain Unicode ranges commutes with addition in token space

This seems flawed. I mean, the author's statement here is literally true, but it's eliding a very important detail: LLMs do _not_ see token indexes. They have no idea what order the token embeddings are in. In fact, you can shuffle the embeddings and the LLM wouldn't care at all. And I highly suspect that if you shuffled the entire tokenizer, so that the above property no longer holds, and trained Claude from scratch on that tokenizer, it would still be able to perform this task.

> so all but one of these symbols is mapped to three tokens each, where the first two are the same and can be easily ignored by an attention head, and the third token increments exactly with the Unicode.

This is the crux, I believe.

In the general case, the common Unicode ranges (for Korean, Japanese, Chinese, etc) get tokenized just like English (for modern tokenizers at least).

It's only in the obscure unicode ranges where you hit a special case of the tokenizer. This is the "backup plan" of the tokenizer. If it encounters text that doesn't directly map to a token in its dictionary, then it falls back to encoding the text as UTF-8 bytes. Those UTF-8 bytes have a dedicated set of 256 tokens in its dictionary. So in those extreme cases, rather then getting bits of text like "Hell, o, Mr, ., B, ond" the LLM gets the raw UTF-8 bytes.

Now, again, the LLM can't directly see those bytes, their index in the tokenizer's dictionary, their integer values, etc, etc. It only sees their embedding vectors, which are unordered. So it has no _implicit_ knowledge about those bytes being ordered. Therefore the assertion that addition commutes between Unicode and token indices is irrelevant.

My theory would be that the pretraining data contains lists of Unicode characters. Specifically, lists of unicode characters in order. Naturally, for the obscure ranges of unicode, this results in the LLM seeing counting in UTF-8 bytes. It doesn't initially know what the "value" of each byte is, but naturally it would learn that so that it can correctly predict the next byte.

The same occurs for English letters. It doesn't start with any knowledge about what order they are in. It only learns the ordered alphabet through seeing examples.

(The inverse applies, of course, since the output is also unordered.)

Maybe this is a nitpick? But it seems important to me, because it's the difference between a rather simple mechanism:

output[i] = input[i] + 1

and a more complex mechanism:

c = to_utf8_byte_index(input[i]) c = c + 1 output[i] = from_utf8_byte_index(c)

Also it's important because I'd suspect the LLM will see a _lot_ of UTF-8 counting. There's about a million unicode "characters", the vast majority of which won't have direct token mappings. So in rough estimation for a single complete listing of Unicode, it'd see a list of purely counting in bytes that is 1 million lines long. That's 3900 complete cycles of the least significant byte. Just from one listing.

In contrast, it's not going to encounter a lot of listings of, say, the Korean unicode range in unicode order (about 11k points). Each time it does, it gets to see exactly 1 complete cycle.

So a single listing of Unicode gives it 3900 examples of how to cycle one byte VS a single listing of an "alphabet" giving it only 1 example.



You're completely right, my argument is fundamentally wrong because it relies on the commutativity, but the embedding matrix obviously does not treat some columns differently than others. Back to the drawing board I suppose. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: