Consider adding a bit of overlap to the text chunks. Say, 300 tokens: text_split...

jcutrell · on April 19, 2023

This is actually pretty insightful - I have done something similar with splitting my obsidian data into chunks using paragraphs and headers as demarcation, but this solves a more interesting problem of nuance! I like it.

summarity · on April 19, 2023

If you're interested in improved chunking, I mentioned a few strategies in my talk here (timestamp linked, <1min): https://youtu.be/elNrRU12xRc?t=536 that I used when building https://findsight.ai

cs702 · on April 19, 2023

If you're already splitting documents by paragraph, consider using (as much as possible of) the previous and next paragraphs as overlap.

sergiotapia · on April 19, 2023

We did chunks with a sliding window of previous page + current page + next page, with overlaps. That produced the best results.