Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Overly specific LLM research into KV cache eviction.

The vast majority of tokens in a sequence will be irrelevant to an attention mechanism outside of a very small window. Right now however we tend to either keep all cache values forever, or dump them all once they hit a certain age.

My theory is that you can train model to look at the key vectors and from that information alone work out how long to keep a the token in the cache for. Results so far look promising and it’s easy to add after the fact without retraining the core model itself.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: