Overly specific LLM research into KV cache eviction. The vast majority of tokens...

Overly specific LLM research into KV cache eviction.

The vast majority of tokens in a sequence will be irrelevant to an attention mechanism outside of a very small window. Right now however we tend to either keep all cache values forever, or dump them all once they hit a certain age.

My theory is that you can train model to look at the key vectors and from that information alone work out how long to keep a the token in the cache for. Results so far look promising and it’s easy to add after the fact without retraining the core model itself.