Overly specific LLM research into KV cache eviction.
The vast majority of tokens in a sequence will be irrelevant to an attention mechanism outside of a very small window.
Right now however we tend to either keep all cache values forever, or dump them all once they hit a certain age.
My theory is that you can train model to look at the key vectors and from that information alone work out how long to keep a the token in the cache for. Results so far look promising and it’s easy to add after the fact without retraining the core model itself.
The vast majority of tokens in a sequence will be irrelevant to an attention mechanism outside of a very small window. Right now however we tend to either keep all cache values forever, or dump them all once they hit a certain age.
My theory is that you can train model to look at the key vectors and from that information alone work out how long to keep a the token in the cache for. Results so far look promising and it’s easy to add after the fact without retraining the core model itself.