I'm also curious about how to think about chunking for Doc-to-Doc similarity. Sa... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		CityOfThrowaway on Oct 25, 2023 \| parent \| context \| favorite \| on: Embeddings: What they are and why they matter I'm also curious about how to think about chunking for Doc-to-Doc similarity. Say I have an input document that I want to use as my search query and a database of potential results documents. It's not clear to me that I should simply embed both the input document and each of the results documents. If the documents contain a variety of ideas, I'd be nervous they wind getting embedded into some generic space. But I don't have any good ideas as to what type of chunk-match-and-aggregate strategy might work well here. Would love ideas from folks that have done stuff like this!

Swizec on Oct 25, 2023 | [–]

> If the documents contain a variety of ideas

I've found chunking by sub-headings works really well.

But here's a dirty secret I've learned as a writer: Your document can only ever contain 1 idea. That's the most that human readers can manage.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact