Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is it wherever the tokens are, or is it the N first tokens they've seen before? Ie if my prompt is 99% the same, except for the first token, will it be cached?




The prefix has to be stable. If you are 99% the same but the first token is different it won't cache at all. You end up having to design your prompts to accommodate this.

which is important to bear in mind if people are introducing a "drop earliest messages" sliding window for context management in a "chat-like" experience. once you're at that context limit and start dropping the earliest messages, you're guaranteeing every message afterwards will be a cache miss.

a simple alternative approach is to introduce hysteresis by having both a high and low context limit. if you hit the higher limit, trim to the lower. this batches together the cache misses.

if users are able to edit, remove or re-generate earlier messages, you can further improve on that by keeping track of cache prefixes and their TTLs, so rather than blindly trimming to the lower limit, you instead trim to the longest active cache prefix. only if there are none, do you trim to the lower limit.


That's what I thought, thanks Simon.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: