Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
cs702
on April 19, 2023
|
parent
|
context
|
favorite
| on:
ChatPDF – Chat with Any PDF
Consider adding a bit of overlap to the text chunks. Say, 300 tokens:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=300)
Otherwise, you'll likely end up with too many edge cases in which only part of a relevant context is retrieved :-)
jcutrell
on April 19, 2023
|
next
[–]
This is actually pretty insightful - I have done something similar with splitting my obsidian data into chunks using paragraphs and headers as demarcation, but this solves a more interesting problem of nuance! I like it.
summarity
on April 19, 2023
|
parent
|
next
[–]
If you're interested in improved chunking, I mentioned a few strategies in my talk here (timestamp linked, <1min):
https://youtu.be/elNrRU12xRc?t=536
that I used when building
https://findsight.ai
cs702
on April 19, 2023
|
parent
|
prev
|
next
[–]
If you're already splitting documents by paragraph, consider using (as much as possible of) the previous and next paragraphs as overlap.
sergiotapia
on April 19, 2023
|
prev
[–]
We did chunks with a sliding window of previous page + current page + next page, with overlaps. That produced the best results.
Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: