I think this is really cool! It's really nice to use a standards-compliant persi...

foxish · on April 11, 2019

Hi yingw787, I work on the product team at Rockset. Thanks for your thoughts! I'll try and answer your questions below.

- The different file formats get indexed and turn into a Rockset specific format which ensures that irrespective of the file type you get excellent performance for your SQL queries. This also means you can JOIN data from different sources (containing files in different formats) using SQL irrespective of the source formats.

- Depending on the complexity of the SQL queries, the latency can range from low tens of milliseconds to a few seconds. Since we index ALL the fields in several ways, if we're able to use our indices to accelerate the query (which is almost always the case), it will likely be in the 10-200 milliseconds range for a wide range of analytical queries. Look out for some numbers in the future.

- Data cleaning is something we facilitate through the use of our delete/update records API that lets you mutate the index and remove/update the records that you consider to be containing bad data. Since Rockset supports schemaless ingest (https://rockset.com/blog/from-schemaless-ingest-to-smart-sch...), error documents don't really break anything and you can work around them by writing a query that ignores them. We are interested in providing visibility into the data so that you can quickly detect issues with the data and fix them.

- Rockset has a REST API, clients in different programming languages (https://docs.rockset.com/rest-api/) and some visualization tools like Tableau (https://docs.rockset.com/tableau/). Can you elaborate on what you mean by colocating data and the extension API?

yingw787 · on April 13, 2019

My impression of most databases is that locating the data physically close together (i.e. an internal network connection ties together database nodes) provides assumptions for performance optimization (e.g. based on internal testing we think there is the tail latency at this percentile is X milliseconds between requests on database nodes, or the network will only fail requests X% of the time, therefore we can optimize this factor in source). If you have disparate data located elsewhere, it may be more difficult to bake in such assumptions (e.g. requests across public Internet may fail more often), and more difficult to achieve performance, and therefore the value-add from a product like Rockset would be to tie together disparate data sources. But I just read your comment that the data is transformed to a Rockset specific format, so it might matter less in that case because you do have a persist filesystem.

For the extensions API, I was imagining something like postgresql-contrib: https://www.postgresql.org/docs/current/contrib.html

In Rockset's case, I thought it would make sense if the data came from multiple locations, extensions requests might take that as a top-level assumption; hence the idea of a Rockset extension for something like Zapier, where multiple Internet services are tied together into automation pipelines (or in Rockset's case, read/write query pipelines).

I just thought of this now, but the client interface for a database like PostgreSQL is useful enough where other databases like CockroachDB can implement it too: https://www.cockroachlabs.com/blog/why-postgres/

Hope this helps :)