Billion-Scale Approximate Nearest Neighbor Search [pdf]

ianbutler · on May 6, 2023

Nice this will get you acquainted with many of the methods used currently, as my small addition to the list of things you can learn. https://ai.googleblog.com/2020/07/announcing-scann-efficient... https://github.com/google-research/google-research/tree/mast...

Google released SCaNN a few years ago and back then it wasn't implemented in many tools like Milvus etc. Maybe it is now, it's been a few years since I needed a scalable vector db. I believe it's still state of the art, and they use a novel approach to quantization, which was the key intuition.

I didn't see it included in the PDF from a quick scan (hah!)

bravura · on May 6, 2023

I remember SCaNN. Are there easy to use implementations available now?

You also seem unsure.

Personally, I love a cheat-sheet like in Slide 7 of the presentation. It's a game-tree for what ANN implementation to use, given your problem setting and the assumption you don't want to do groundbreaking ANN research; rather, you have an ANN problem you need to solve on the path to your actual problem. Maybe that's why SCaNN wasn't cited?

A quick google reveals that SCaNN is available on pypi from Google: https://pypi.org/project/scann/ Do any users have any feedback or experiences to share?

[edit: GPT4, comparing hnwslib and SCaNN: "Note that SCaNN recently ceased to be supported by Google Research, and there won't be any further improvements, bug fixes, or documentation updates from them. However, you can still use and get support from the community. This recent change in support could shift preference towards hnswlib for future applications."]

jhj · on May 6, 2023

SCaNN (the paper) is roughly two different things:

1. a SIMD-optimized form of product quantization (PQ), where code to distance lookup can be performed in SIMD registers

2. anisotropic quantization to bias the database towards returning better maximum inner product search (MIPS) candidates, versus usual quantization (such as PQ) that aims to minimize compressed vector reconstruction error. In MIPS it is much more likely that the query data set may be of a completely different distribution than the database vectors.

If your application needs L2 lookup rather than MIPS which does not admit a metric, then only the PQ part is relevant. For cosine similarity, you can get that by normalizing all vectors to the surface of a hypersphere, in which case it has the same order as L2 (see "L2 normalized Euclidean distance" on https://en.wikipedia.org/wiki/Cosine_similarity ).

SCaNN is implemented in Faiss CPU, on the GPU the fast PQ part is less relevant due to the greater register set size and throughput (rather than latency) optimized nature of the hardware, but the GPU version is more geared towards batch lookup in any case.

https://github.com/facebookresearch/faiss/wiki/Fast-accumula...

https://github.com/facebookresearch/faiss/wiki/Indexing-1M-v...

We have not found the anisotropic quantization part to be that useful for large datasets, but results may vary. Graph-based ANN techniques tend to be better in many cases for these small datasets than IVF based strategies.

(I'm the author of GPU Faiss)

chuckcode · on May 6, 2023

Thanks for details! Few follow up questions:

- I've seen neural nets using int8 for matrix multiplication to reduce memory size [1]. Do you think something similar could be useful in the ANN space?

- Do you know of any studies using Faiss looking at speed/cost tradeoffs of RAM vs flash vs Disk for storage?

- Are there recommended ways to update Faiss index with streaming data, e.g. updating the vectors continuously?

Seems like more and more use cases for Faiss as neural nets become more and more core to workflows. Would like to try and figure out the configurations that are optimized to minimize carbon usage in addition to latency and recall metrics.

[1] https://arxiv.org/abs/2208.07339

(edit for formatting)

jhj · on May 6, 2023

Regarding reduced precision, depending upon what you are trying to do, I think it doesn't work quite as well in similarity search as it does for, say, neural networks.

If you are concerned about recall of the true nearest neighbor (k=1), in many datasets I've seen (especially large ones) in float32 the distance from a query vector to candidate nearest neighbor vectors may only differ by some thousands of ULPs when performing brute force search, which if done in float16 would result in the true nearest neighbor being the same as (or perhaps behind even, due to rounding error) other proximate vectors. If you are performing approximate lookup and you have the luxury of performing reranking (you store the compressed / approximate index for lookup, but return a larger candidate set like k=100 or k=1000 and refine the results based on true distances computed from the uncompressed vectors via brute-force, so you have to keep all the original vectors around) then this problem can go away.

If however you are looking at recall@100 (is the true nearest neighbor reported within the top k=100) or set intersection (of the k=100 approximate nearest neighbors, how much overlap is there with the true set of the 100 nearest neighbors), then this doesn't matter as much.

Certainly a lot of the options in the Faiss library are geared towards compression and quantization anyways (e.g., storing a billion high-dimensional vector dataset in 8/16 GB of memory) so this is a tradeoff as with everything else.

In Faiss there are the scalar quantizer indexes which do store vectors in int4/int8 etc for which int8 GEMM (accumulate in int32) would be great, but using this would require that the query vector itself be quantized to int4/int8 as well. This is the difference between "asymmetric distance computation" (ADC) where you compute the distance between int4 encoded vectors (or product quantized encoded vectors, etc) versus a float32 query vector, where we reconstruct the database vectors in floating point and compare in floating point, versus symmetric distance computation (you have to convert the query vector to int4 say, and compare in the quantized regime). ADC tends to work a lot better than symmetric computation, so this is why we don't use pure int8 GEMM, but maybe in many applications (NN inference, say, instead of image database search) the non-ADC comparison would be ok.

chuckcode · on May 8, 2023

Thanks for the very helpful and detailed reply!

sgt101 · on May 6, 2023

> I remember SCaNN. Are there easy to use implementations available now?

Yes, on gcp in vertex AI.

GPT4 is talking utter horseshit.

https://cloud.google.com/vertex-ai/docs/matching-engine/ann-...

ianbutler · on May 6, 2023

GPT4 is only trained on data up to 2021, and the library has commits as recent as two months ago. So I think that may be a hallucination. A quick google also reveals nothing like that.

It's been a while but I also remember HSNW having undesirable characteristics for large scale vector search (compared to other methods), so I'm not sure there.

I did do a quick search and saw there are methods that make it more suitable, but again it's been 3 years now since I've used anything in this area.

Edit: I got curious, https://ann-benchmarks.com/

So I was mis remembering a bit, HSNW is pretty good overall depending on your implementation!

fzliu · on May 6, 2023

SCaNN is coming to Milvus very soon (we've seen some demand from the community). Stay tuned.

eternalban · on May 6, 2023

both are 2020.

victor106 · on May 6, 2023

Being in software development for the past 15 years I cant understand any of this. What are some good resource to get up to speed on this stuff?

WinLychee · on May 6, 2023

This series of articles from Pinecone seems accessible https://www.pinecone.io/learn/vector-indexes/ . Stepping back a moment, the problem "I want to find the top-k most similar items to a given item from a collection" comes up in a variety of domains, and a common approach is nearest-neighbor search. Depending on the number of dimensions and the number of items, you may need tricks to speed up the search. In recent years we've seen an increase in scale and invention of some new tricks to speed this up.

In a bit more detail the idea is to convert your data into vectors embedded in some d-dimensional space and compute the distance between a query vector and all other vectors in the space. This is an O(dN) computation. ANN covers a number of techniques giving a speedup by letting you reduce number of required comparisons.

visarga · on May 7, 2023

Trust me it is not that complicated, if you spend 1-2 hours on this you will walk away with a good grasp. But if you try the same for neural nets, it's impossible.

Once you understand the dot product operation, which is high-school level math, can be written as a for loop with a multiply and add, it all falls from it. How do you speed up millions of dot products? Try a tree, clusters, hashing, inverted indices, GPUs...

For a few million vectors it's a solved problem since 5 years ago. I was using annoy in 2017, the library is 10 years old, first commit on Feb 20, 2013. It's as old as the deep learning age.

That's why I was curious why all this recent discussion about it, it's certainly a discovery phase for many people when they realise such a tool exists, made possible by the availability of efficient embedding models and needed by LLMs for retrieval.

tysam_and · on May 6, 2023

High dimensional nearest neighbor search scales very poorly with dimensions and/or number of examples IIRC, so approximate methods are best for time/value tradeoffs.

Oftentimes the applications don't require a perfect match anyways, just something 'close enough'.

Because things live in high dimensions, and they're often related to each other, this results in these vast 'sparse' gaps where no data really generally lives.

Hence allowing for good speed of approximate methods, and a good tradeoff of information-per-FLOP when compared to the full scale search itself.

esafak · on May 6, 2023

It crops up in machine learning.

https://en.wikipedia.org/wiki/Nearest_neighbor_search

shoo · on May 6, 2023

see also - the section titled "Distance function" from https://en.wikipedia.org/wiki/Curse_of_dimensionality

> One way to illustrate the "vastness" of high-dimensional Euclidean space is to compare the proportion of an inscribed hypersphere with radius r and dimension d, to that of a hypercube with edges of length 2 r . [...] As the dimension d of the space increases, the hypersphere becomes an insignificant volume relative to that of the hypercube

E.g. in two dimensions the ratio area of the circle of radius r=1 to the area of a square of side length 2r is pi / 4 = approx 0.785. As the number of dimensions becomes large, this ratio drops toward zero.

There's also some discussion of how using distance functions to measure distances from reference points become less useful as the number of dimensions of the space increases -- all distances becoming similar -- although

> research has shown this to only hold in the artificial scenario when the one-dimensional distributions are independent and identically distributed.

eachro · on May 6, 2023

I see HN pieces on SIMD optimizations for numerical computing every so often. Are these the sort of optimizations that a hobbyist with say a normal laptop (either a macbook air or consumer grade thinkpad) can actually end up tinkering with as well? Or am I going to have to rent some AWS machine and have it run on some specific architecture to be able to mess around with?

boulos · on May 6, 2023

Absolutely. Most development work for these things is just fine on a laptop.

Each processor family has its own supported SIMD instructions though. For a long time, SSE / AVX / AVX2 were the only game in town on x86. AVX-512 introduced in Skylake was primarily available only on server parts, and was a mixed bag, and didn't have support on AMD machines until just recently (with Genoa / Zen4).

A modern Mac laptop with an M-series chip will have ARM NEON vector instructions, just like on an iPhone or similar. There is a new-ish ARM vector instruction set called Scalable Vector Extensions (SVE) that Apple doesn't support, but Graviton 3 does. But generally improvements on your laptop often translate onto the server side, and you can definitely test correctness.

There are a few least-common denominator SIMD libraries out there, but realistically for many problems, you'll get the most mileage from writing some intrinsics yourself for your target platform.

w10-1 · on May 7, 2023

The easiest way for a beginner to do SIMD numerical computing now is with Java.

Java's Panama project is working on their 6th preview (JEP-450) for the September 21 (LTS) release, but the released Java 20 previews JEP-426 features, including

- basic math

- transcendental functions (sin, cos...)

- load/store to foreign memory e.g., to interop with C

- compress/expand lanes, for parallel selection or filtering

- bit-wise ops (with count, reverse, compress, expand)

- all translated on Intel to SSE/AVX or on ARM to NEON/SVE

See e.g.,

- <https://jbaker.io/2022/06/09/vectors-in-java/>

- <https://openjdk.org/jeps/426>

- <https://openjdk.org/jeps/448>

- <https://openjdk.org/projects/panama/>

perfmode · on May 6, 2023

I use optimized SIMD instructions on iOS to do on-device linear algebra for the cooking app i work on as a hobby.

https://theflavor.app/

manx · on May 6, 2023

I only see a landing page with missing images (on mobile). How do I use it?

perfmode · on May 7, 2023

here’s a TestFlight beta testing link

https://testflight.apple.com/join/WYT0giJd

zamnos · on May 7, 2023

If you want to dig around with the lower level implementation of using SIMD instructions, you might need to drop to assembly to really get into the guts of things just to see it in action. But if you just want your code to take advantage of it, you can just use Numpy in Python, which will probe what the CPU supports and optimize for that. SIMD is not really new, the first modern opcode to do SIMD was MMX, introduced in Intel in 1997, but the idea has been around since the 70's. So your laptop should at least support that command, and a whole host of others.

https://numpy.org/doc/stable/reference/simd/index.html

kzrdude · on May 6, 2023

This is a Rust crate doc, but it's a nice place to browse for different cpu features and what they unlock: https://docs.rs/target-features/latest/target_features/docs/...

If you go to the x86 cpu list it will tell you what features are enabled if you would optimize for a particular one https://docs.rs/target-features/latest/target_features/docs/...

On linux you can run lscpu to check what cpu features are detected.

nwoli · on May 6, 2023

Basically all CPUs have simd nowadays. You can even make use of it in wasm/the web. (Though currently not supported on iOS.) Just compile with ‘emcc -msimd128 main.c’

lmeyerov · on May 6, 2023

Yes, regular computers have had SIMD for decades. Ex: Intel was SSE and now AVX. Fancier servers will have the same, except maybe wider and some other tricks.

If you go down the vectorization path, unless you are a crufty CPU DB vendor, I'd skip ahead to GPU for the same. Ecosystem is built to do the same but faster and easier. CPU makers are slowly changing to look more like GPUs, so just skip ahead...

sebasv_ · on May 6, 2023

If you're asking how widely available SIMD is, it has been common in consumer hardware for 2 decades. To perform SIMD instructions manually you will need a compiled language that supports it, like Rust or C. But the compiler can actually implement it for you as an optimization.

nwoli · on May 6, 2023

Spotify uses random hyperplanes to produce a hash where each bit is the sign of the dot product of the query and the hyperplanes (ie “is above or below” the hyperplane). Works really well

AndrewKemendo · on May 6, 2023

Is there a blog post or something on this?

fancy_pantser · on May 7, 2023

There's at least a repo: https://github.com/spotify/annoy

AndrewKemendo · on May 7, 2023

Thank you!

gregw134 · on May 7, 2023

I think that's just a variation on locality sensitive hashing. You can find lots of articles online.

ChrisPzilla · on May 9, 2023

Here is an article on Indexes https://zilliz.com/blog/vector-index

nwoli · on May 10, 2023

It’s wrong though, annoy is not tree based

convexstrictly · on May 6, 2023

Video: https://www.youtube.com/watch?v=iI8e3kU11eU

visarga · on May 7, 2023

Very good post, I was just searching yesterday to pick an ANN algorithm and was disappointed with the super-busy graphs from ann-benchmarks. It is a hard decision to take - indexing speed, lookup speed, memory consumption, GPU access, multiprocessing - they all matter differently from case to case.

swaraj · on May 6, 2023

Is this different from pinecone?

beoberha · on May 6, 2023

From what I understand, Pinecone is a wrapper around a lot of this. It utilizes these libraries under the hood (or their own implementations) to store and perform search on your vectors for you. All in a nice managed PaaS offering.