NB: Can't edit my original reply. Sorry actually I misread part of your comment ...

bnjmn · 2024-12-23T13:09:18 1734959358

Thanks for these detailed replies! Now I really want to read your paper.

mcyc · 2024-12-23T13:20:09 1734960009

Thanks!

Our paper [1] is kind of a goofy adversarial thing where we thought "here's this cool metric, how can we break it?". The tokenizers we propose are definitely not tokenizers you should use in practice.

The original paper that proposes the metric is, imo, much more interesting theoretically [2].

[1]: https://aclanthology.org/2024.lrec-main.1469/

[2]: https://aclanthology.org/2023.acl-long.284/