Well you could store numbers all fine, but indexing vectors for similarity queri...

deepGem · on Oct 3, 2023

How is indexing a vector different from indexing a varchar or an integer ? If you convert a vector into a byteaarray it should be no different from a bytearray of varchar but for the bytearray contents.

Now if you want to do similarity search you have to measure the distance between 2 or more vectors and that's independent of the indexing. No ?

So any database with sufficient memory should be able to accomplish this as evidenced by the vector similarity search feature of Redis. ( I don't know how Redis folks have implemented vector similarity but they do support KNN search )

sgarland · on Oct 3, 2023

Mostly the number of dimensions. Assuming your vectors are float16, so 2 bytes each, you’d run into Postgres’ B+tree index limit (2704 bytes) very quickly. You could index a 512-dimension vector fine, but I believe most models are well beyond that.

There are alternative index types, of course, or you could index the hash of the vector. These both come with tradeoffs.

smilliken · on Oct 3, 2023

Btree isn't a very useful index type for a vector, though. GIN, GIST, and the handful of new extensions optimizing for vector search are what you'd want (and don't have this limitation).

Aside, you can increase the size of tuples you can index in a postgresql btree by increasing the postgresql page size (requires a recompile and creating a new database instance).

sgarland · on Oct 3, 2023

Agreed to the first, but you have to first know that those exist (and what they’re good for). This leads into my second point: IME, the Venn diagram for “people making AI stuff” and “people capable of compiling and running their own DB in a reliable manner” has no overlap.

regularfry · on Oct 3, 2023

Indexing in a vector engine is what gives you similarity search faster than brute force. The type of engine is what gives you various different distance measures (often approximate). Redis specifically has two choices: brute force (which is precise and slow), or HNSW (which is approximate and fast, but space-consuming for interesting dimensionality).

JustLurking2022 · on Oct 3, 2023

The distance computation could be separate from the indexing, but it will be inefficient relative to having an index organized to support the task.

sgu999 · on Oct 3, 2023

sqlite has r-trees for instance [0]. Could it be good enough for most use cases? If it's to query a knowledge base for instance, a couple dimensions should be sufficient. With the added benefit of being able to query your data in other ways.

[0] https://www.sqlite.org/rtree.html

mattashii · on Oct 3, 2023

r*-trees work doesn't work well when the number of dimensions stored in the index is much higher than the logarithm of the number of indexed entries, and this is a prevailing property of divide-and-conquer spatial index types when the keyspace is divided based on a single dimension at a time. As vectors regularly have 100+ dimensions, normal spatial indexing methods applied to vectors wouldn't be very efficient for anything with much less than 2^100 index entries; which is quite suboptimal for most datasets that you would want to have indexed.

KirillPanov · on Oct 3, 2023

Also the distance metric for r*-trees is just plain wacky for anything other than low-dimensional Euclidean space.

Even if you could make it perform well, it would not do what you want.

modulovalue · on Oct 3, 2023

Are you saying this because r-trees expect a proper metric space, and people have the need to index datasets over non-metric spaces?

pornel · on Oct 3, 2023

The curse of dimensionality creates a seemingly paradoxical situation where you have a vast vast search space, but everything is incredibly close to each other. Space subdivision algorithms become ineffective.

jpcapdevila · on Oct 3, 2023

Here is a SQLite extension that uses Faiss under the hood.

https://github.com/asg017/sqlite-vss

Not associated with the project, just love SQLite and find it very useful.

thesz · on Oct 3, 2023

What is "vector search over object storage?" Does deeplake performs some computations on objects and search on their embeddings?

avereveard · on Oct 3, 2023

It stores everything on cheap storage, with no compute associated (i.e. S3) and uses the client to compute the query embeddings and to retrieve the embedding index and to run an indexed search to identify the data to be retrieved, and likewise the client does the work of updating the index structure on writing.

The benefit is that you don't have to pay for the compute part of a database, and the storage layer is as cheap as it could be on the cloud.

deepGem · on Oct 3, 2023

At the expense of latency ? when in fact latency is the most important aspect for any search. Any idea on how fast the searches from the client are ?

retrieve the embedding index and to run an indexed search to identify the data to be retrieved. Please bear with the layman like questioning -

So if the data is {obj: "obj1, "data": {"name": "atlas", "embedding": "1123124234" } What is an embedding index ? Is it something like {"1123124234": "obj1"} ?

From what I understand the query will be "geography" whose embedding will be "12311111" and now you have to run a KNN for a match which will return {"name": "atlas", "embedding": "1123124234"}

Not sure where the embedding index comes into play here.

avereveard · on Oct 3, 2023

Eh, sure latency is suboptimal. But if you have a LLM in the mix, that latency will dominate the overall response time. At that point you might not care about how performant your index is, and since performance/cost is non linear, it can translate to very significant savings