Hacker News new | past | comments | ask | show | jobs | submit login

Consider adding a bit of overlap to the text chunks. Say, 300 tokens:

  text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=300)
Otherwise, you'll likely end up with too many edge cases in which only part of a relevant context is retrieved :-)



This is actually pretty insightful - I have done something similar with splitting my obsidian data into chunks using paragraphs and headers as demarcation, but this solves a more interesting problem of nuance! I like it.


If you're interested in improved chunking, I mentioned a few strategies in my talk here (timestamp linked, <1min): https://youtu.be/elNrRU12xRc?t=536 that I used when building https://findsight.ai


If you're already splitting documents by paragraph, consider using (as much as possible of) the previous and next paragraphs as overlap.


We did chunks with a sliding window of previous page + current page + next page, with overlaps. That produced the best results.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: