I'm also curious about how to think about chunking for Doc-to-Doc similarity.
Say I have an input document that I want to use as my search query and a database of potential results documents.
It's not clear to me that I should simply embed both the input document and each of the results documents. If the documents contain a variety of ideas, I'd be nervous they wind getting embedded into some generic space.
But I don't have any good ideas as to what type of chunk-match-and-aggregate strategy might work well here.
Would love ideas from folks that have done stuff like this!
Say I have an input document that I want to use as my search query and a database of potential results documents.
It's not clear to me that I should simply embed both the input document and each of the results documents. If the documents contain a variety of ideas, I'd be nervous they wind getting embedded into some generic space.
But I don't have any good ideas as to what type of chunk-match-and-aggregate strategy might work well here.
Would love ideas from folks that have done stuff like this!