Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm also curious about how to think about chunking for Doc-to-Doc similarity.

Say I have an input document that I want to use as my search query and a database of potential results documents.

It's not clear to me that I should simply embed both the input document and each of the results documents. If the documents contain a variety of ideas, I'd be nervous they wind getting embedded into some generic space.

But I don't have any good ideas as to what type of chunk-match-and-aggregate strategy might work well here.

Would love ideas from folks that have done stuff like this!



> If the documents contain a variety of ideas

I've found chunking by sub-headings works really well.

But here's a dirty secret I've learned as a writer: Your document can only ever contain 1 idea. That's the most that human readers can manage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: