At Flickr, we worked closely with the Vespa team from 2011 through 2016 on a wide range of advancements:
* partial document refeeding (i.e. expedite indexing a new field to 20+ billion documents without refeeding everything and staying online handling 100M+ free text queries a day)
* visual similarity search - check out the tensor ranking features [1] [2]
* online elasticity - add/remove replicas / shards online. A must when it could take weeks+ to re-feed from scratch. This is non-trivial to make work smoothly at scale.
* latency / tail-latency on complex queries. p90 reduction from 3,000 to 30 ms.
This is a major gift to the open-source community of a battle-tested search engine that works reliably without babysitting with very large datasets, and simultaneous high query / high feed volumes. Huge debt of gratitude to the team in Trondheim and Verizon/Oath/Yahoo legal & management teams for making this happen. :+1:
Not precisely sure where we were in 2011, but I think these are the biggest ones that came after, off the top of my head (i.e sure to be missing something):
- Merging content and index clusters to one to make index clusters elastic and auto-recovering on data loss.
- Fully realtime writes.
- Support more advanced machine-learned ranking through tensors.
- Streaming (personal) search supporting a large write rate.
- Document references.
- WAND and RANK operators.
- Rank features over multivalue text fields.
- Predicate fields.
- Lots and lots of performance work.
Cluster file distribution with bittorrent https://github.com/vespa-engine/vespa/tree/master/filedistri...