Though only related to topic wrt LLMs: It seems LLMs occasionally mix up CJK vocabularies and also generate invalid UTF-8 sequences, due to CJK texts having overlapping code points and inputs being processed by tokenizer. Are there developments in that directions? Aren't CJK ideograms essentially tokens?