Sorry actually I misread part of your comment in relation to the paper and confused δ and another parameter, K.
To clarify, δ is the number of tokens in the tokenized corpus and K is the size of the vocabulary.
So, if you are asking about why would they limit _K_, then my answer still applies (after swapping δ for K). But if you still mean "why do they pick some arbitrary δ as the limit of the size of the tokenized corpus", then I think the answer is just "because that makes it a decision problem".
Our paper [1] is kind of a goofy adversarial thing where we thought "here's this cool metric, how can we break it?". The tokenizers we propose are definitely not tokenizers you should use in practice.
The original paper that proposes the metric is, imo, much more interesting theoretically [2].
Sorry actually I misread part of your comment in relation to the paper and confused δ and another parameter, K.
To clarify, δ is the number of tokens in the tokenized corpus and K is the size of the vocabulary.
So, if you are asking about why would they limit _K_, then my answer still applies (after swapping δ for K). But if you still mean "why do they pick some arbitrary δ as the limit of the size of the tokenized corpus", then I think the answer is just "because that makes it a decision problem".