Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd like to know:

How the language model-based tokenizers fare on domain-specific documents, since language models don't have context for unknown tokens.

Are language model-based tokenizers any better at identifying abbreviations than rule-based ones?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: