Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Though only related to topic wrt LLMs: It seems LLMs occasionally mix up CJK vocabularies and also generate invalid UTF-8 sequences, due to CJK texts having overlapping code points and inputs being processed by tokenizer. Are there developments in that directions? Aren't CJK ideograms essentially tokens?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: