Interesting, it's actually worse than GPT-4s 100k tokenizer by quite a bit despite being over twice the size and only marginally better than LLama's 30k. At least for some random articles in English latin script that I tried anyway, but Llama and Gemma are English-only models so no point in testing anything else.
Doesn't seem like a well made tokenizer at first glance or it's heavily biased towards languages the model can't even generate coherently, lol. If they really wanted it to be SOTA at something they could've at least made it the first open source truly multilingual model, but that's apparently more effort than the lame skin colour oriented virtue signalling Google wants to do.
Doesn't seem like a well made tokenizer at first glance or it's heavily biased towards languages the model can't even generate coherently, lol. If they really wanted it to be SOTA at something they could've at least made it the first open source truly multilingual model, but that's apparently more effort than the lame skin colour oriented virtue signalling Google wants to do.