I was at a place that used sharding, so the data was scattered across 128 databa...

Jupe · on Dec 22, 2022

IMO...

If the query is for OLAP the data may need to be extracted to another data store.

If the query is for OLTP, then the design is wrong. I don't know your problem space, but pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

paulmd · on Dec 22, 2022

> If the query is for OLTP, then the design is wrong. I don't know your problem space, but pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

well, that's the basic idea of microservices lol. forget living on a different shard, lots of times your data is going to round-trip to JSON and back a couple times and then be manually joined in some backend/service layer, or in graphql!

one bad abstraction I see a lot from microservice teams (that don't really understand it past the high-level concept) is "every table is a service", or "every minimal set of tables and its codeset is a service" and that's exactly how that ends up. Microservices really ought to be chunky enough to do their business without ending up calling 27 different services under the hood just to do simple operations. Obviously there is a point where it's too chunky, but too micro is also bad too.

maratc · on Jan 1, 2023

> pulling data from 128 shards to resolve queries while a user is waiting is just a really bad idea.

To the contrary, pulling data from 128 shards can be done in parallel, and about 127 of them don't have any data to return.

paulmd · on Dec 22, 2022

Yes. This is a super common problem with no-sql engines - we ran into something similar with SOLR when objects are not flattened (eg @JsonUnwrapped annotation). Child objects are stored as separate documents with a join... but if the child object is not stored in the same [file-]block then predicate-scans for the parent may not encounter the child object that causes predicate satisfaction. This breaks deep pagination and some other abstractions.

To me this really is the fundamental distinction for no-sql vs RDBMS. If your data model involves lots of joins... it's RDBMS even if you're using mongo or some other document store under the hood. ideally you will be storing some large analytical document that contains a lot of details about the thing, rather than just treating it as "rows as a document".

the thing about JOINs breaking across blocks/shards is one thing, and it's ultimately something you can work around for a lot of data (again, flatten with @JsonUnwrapped for example) but if you find yourself reaching for joins, your data is relational, or at least your representation is relational.

higeorge13 · on Dec 22, 2022

If you had to do joins on different shards, then you have implemented sharding wrong.

jandrewrogers · on Dec 22, 2022

This is to some extent an implementation limitation, not a theoretical one.

Typical SQL databases support neither the data organization nor parallel orchestration features required to support these types of JOINs well. The practical issue is that you can't add these features to an existing database kernel architecture if it was not designed to make this feasible from day one, and people are rightly reluctant to design a new SQL database kernel architecture from scratch so that these features are available. SQL databases are trapped in a local minima.