I think this is really cool! It's really nice to use a standards-compliant persistent file format; I think a lot of companies have their own persist implementations that render the data only visible at the SQL or REST layer.
I'm wondering:
- Would it be possible to add certain guarantees to performance characteristics for different file formats? Parquet and column-oriented stores operate a good deal different from CSV and row-oriented stores. Would you have to scan the binary?
- Can you combine different persist types together? How do the performance characteristics change?
- What do you do about unclean data and disjoint data sets? Does somebody else have to clean them? What happens if somebody "corrupts" data (say, replaces a CSV delimiter type in-place while Rockset is running)?
- Is there an extensions API available (e.g. SQL through Google Spreadsheets and CSV on AWS S3, both through Zapier)? That could deliver a big value-add, since if your data can be colocated more efficient means and alternatives can be applied.
Hi yingw787, I work on the product team at Rockset. Thanks for your thoughts!
I'll try and answer your questions below.
- The different file formats get indexed and turn into a Rockset specific format which ensures that irrespective of the file type you get excellent performance for your SQL queries.
This also means you can JOIN data from different sources (containing files in different formats) using SQL irrespective of the source formats.
- Depending on the complexity of the SQL queries, the latency can range from low tens of milliseconds to a few seconds. Since we index ALL the fields in several ways,
if we're able to use our indices to accelerate the query (which is almost always the case), it will likely be in the 10-200 milliseconds range for a wide range of analytical queries.
Look out for some numbers in the future.
- Data cleaning is something we facilitate through the use of our delete/update records API that lets you mutate the index and remove/update the records that you consider to be containing bad data. Since Rockset supports schemaless ingest (https://rockset.com/blog/from-schemaless-ingest-to-smart-sch...), error documents don't really break anything and you can work around them by writing a query that ignores them. We are interested in providing visibility into the data so that you can quickly detect issues with the data and fix them.
My impression of most databases is that locating the data physically close together (i.e. an internal network connection ties together database nodes) provides assumptions for performance optimization (e.g. based on internal testing we think there is the tail latency at this percentile is X milliseconds between requests on database nodes, or the network will only fail requests X% of the time, therefore we can optimize this factor in source). If you have disparate data located elsewhere, it may be more difficult to bake in such assumptions (e.g. requests across public Internet may fail more often), and more difficult to achieve performance, and therefore the value-add from a product like Rockset would be to tie together disparate data sources. But I just read your comment that the data is transformed to a Rockset specific format, so it might matter less in that case because you do have a persist filesystem.
In Rockset's case, I thought it would make sense if the data came from multiple locations, extensions requests might take that as a top-level assumption; hence the idea of a Rockset extension for something like Zapier, where multiple Internet services are tied together into automation pipelines (or in Rockset's case, read/write query pipelines).
I just thought of this now, but the client interface for a database like PostgreSQL is useful enough where other databases like CockroachDB can implement it too: https://www.cockroachlabs.com/blog/why-postgres/
I'm wondering:
- Would it be possible to add certain guarantees to performance characteristics for different file formats? Parquet and column-oriented stores operate a good deal different from CSV and row-oriented stores. Would you have to scan the binary?
- Can you combine different persist types together? How do the performance characteristics change?
- What do you do about unclean data and disjoint data sets? Does somebody else have to clean them? What happens if somebody "corrupts" data (say, replaces a CSV delimiter type in-place while Rockset is running)?
- Is there an extensions API available (e.g. SQL through Google Spreadsheets and CSV on AWS S3, both through Zapier)? That could deliver a big value-add, since if your data can be colocated more efficient means and alternatives can be applied.
This is neat!