As a grad student (and an ADHDer), I had trouble doing literature review systematically. To combat this, I made a website that finds similar papers using the meaning of the thing I am looking for.
I used MixedBread's [^1] embedding model to generate vectors from the abstracts. I store and search similar vectors using Milvus [^2] and finally use Gradio [^3] to serve the frontend. I update the vector database weekly by pulling the metadata dataset from Kaggle [^4].
To speed up the search process on my free oracle instance, I binarise the embeddings and use Hamming distance as a metric.
I would love your feedback on the site :)
Happy Holidays!
[1]: https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-...
[2]: https://milvus.io/
[3]: https://www.gradio.app/
[4]: https://www.kaggle.com/datasets/Cornell-University/arxiv
2. how much efficiency gain did you see binarising embeddings/using hamming distance?
3. why milvus over other vector stores?
4. did you automate the weekly metadata pull? just a simple cron job? anything else you need orchestrated?
user thoughts on searching for "transformers on byte level not token level" - was good but didnt turn up https://arxiv.org/abs/2412.09871 <- which is more recent, more people might want
also you might want more result density - so perhaps a UI option to collapse the abstracts and display more in the first glance.
reply