Rockset – Serverless search and analytics engine

bastawhiz · on April 12, 2019

I have a medium-data database storing events (timestamps + some metadata, ~200gb). It currently lives on an Aurora postgres cluster. If this were rock-solid, my application would be the perfect fit, I think. I have two major concerns:

1. I'd originally started on a managed solution (https://getconnect.io/). They promised scalability and reliability, but we were forced to move away when the queries would take upwards of thirty seconds, inserts would start failing, and we'd receive 503s. I estimate that my database (which, at the time, was almost two orders of magnitude smaller) was among their largest. Why should I trust Rockset with my data?

2. I can't find anything about performance, benchmarks, or any other information about how the service will behave in production. InfluxDB is an example of something that does a great job of this: their docs outline what Influx is good at, what it's not good at, what will make your queries slow, etc. Instead, Rockset has a how-to guide for building a FB messenger chat bot with CSVs. From their docs:

> Rockset has a cloud-scale architecture. It can scale both compute and storage independent of one another. One one hand, you can have a small data set served by zillions of compute in parallel to make queries faster. On the other hand, you can have petabyte size data sets served by a small number of compute nodes. And, of course, you can have the entire spectrum in-between these two scenarios.

What does that even _mean_?

Sorry, Rockset, I'm going to need more than zillions of compute to convince me to move my business to you.

foxish · on April 12, 2019

Hi bastawhiz,

Performance numbers are coming, please look out for them in a future blog post. Concrete details about the architecture included in https://rockset.com/Rockset_Concepts_Design_Architecture.pdf. If you think we could add value to you beyond what you get with Aurora PG, then I'd welcome a chance for an evaluation on a 100% free trial (no credit card required) where you could test out the performance for yourself with your own data. Totally respect the skepticism. please reach me directly at anirudh at rockset.com if you'd like to chat further.

manigandham · on April 12, 2019

I see that the pricing model has changed from flat-rate to per-GB. Much more interesting to work now, although still on the high-end (but nice that there's no further charge for queries). The previous comparison to a Postgres instance using JSONB still stands.

How do query times scale with data size? Also is there any full-text search (other than regex)? That would make it more compelling.

foxish · on April 12, 2019

Full-text search related functions can be found here: https://docs.rockset.com/text-search-functions/

As the data grows, we manage shards under the hood to ensure that the data spread across more nodes which in turn helps us use more parallelism. It is possible to manage the query performance and the data size independently in our architecture. Look out for numbers in a future blog post.

agnokapathetic · on April 12, 2019

and 100KB/s is the max ingest rate?

foxish · on April 12, 2019

That's just for streaming input by default. We do work with users and can increase it if the use-case demands it. For bulk ingest from sources like S3, that limit does not apply and it typically does many MB/s.

(I work on the product team at Rockset.)

yingw787 · on April 11, 2019

I think this is really cool! It's really nice to use a standards-compliant persistent file format; I think a lot of companies have their own persist implementations that render the data only visible at the SQL or REST layer.

I'm wondering:

- Would it be possible to add certain guarantees to performance characteristics for different file formats? Parquet and column-oriented stores operate a good deal different from CSV and row-oriented stores. Would you have to scan the binary?

- Can you combine different persist types together? How do the performance characteristics change?

- What do you do about unclean data and disjoint data sets? Does somebody else have to clean them? What happens if somebody "corrupts" data (say, replaces a CSV delimiter type in-place while Rockset is running)?

- Is there an extensions API available (e.g. SQL through Google Spreadsheets and CSV on AWS S3, both through Zapier)? That could deliver a big value-add, since if your data can be colocated more efficient means and alternatives can be applied.

This is neat!

foxish · on April 11, 2019

Hi yingw787, I work on the product team at Rockset. Thanks for your thoughts! I'll try and answer your questions below.

- The different file formats get indexed and turn into a Rockset specific format which ensures that irrespective of the file type you get excellent performance for your SQL queries. This also means you can JOIN data from different sources (containing files in different formats) using SQL irrespective of the source formats.

- Depending on the complexity of the SQL queries, the latency can range from low tens of milliseconds to a few seconds. Since we index ALL the fields in several ways, if we're able to use our indices to accelerate the query (which is almost always the case), it will likely be in the 10-200 milliseconds range for a wide range of analytical queries. Look out for some numbers in the future.

- Data cleaning is something we facilitate through the use of our delete/update records API that lets you mutate the index and remove/update the records that you consider to be containing bad data. Since Rockset supports schemaless ingest (https://rockset.com/blog/from-schemaless-ingest-to-smart-sch...), error documents don't really break anything and you can work around them by writing a query that ignores them. We are interested in providing visibility into the data so that you can quickly detect issues with the data and fix them.

- Rockset has a REST API, clients in different programming languages (https://docs.rockset.com/rest-api/) and some visualization tools like Tableau (https://docs.rockset.com/tableau/). Can you elaborate on what you mean by colocating data and the extension API?

yingw787 · on April 13, 2019

My impression of most databases is that locating the data physically close together (i.e. an internal network connection ties together database nodes) provides assumptions for performance optimization (e.g. based on internal testing we think there is the tail latency at this percentile is X milliseconds between requests on database nodes, or the network will only fail requests X% of the time, therefore we can optimize this factor in source). If you have disparate data located elsewhere, it may be more difficult to bake in such assumptions (e.g. requests across public Internet may fail more often), and more difficult to achieve performance, and therefore the value-add from a product like Rockset would be to tie together disparate data sources. But I just read your comment that the data is transformed to a Rockset specific format, so it might matter less in that case because you do have a persist filesystem.

For the extensions API, I was imagining something like postgresql-contrib: https://www.postgresql.org/docs/current/contrib.html

In Rockset's case, I thought it would make sense if the data came from multiple locations, extensions requests might take that as a top-level assumption; hence the idea of a Rockset extension for something like Zapier, where multiple Internet services are tied together into automation pipelines (or in Rockset's case, read/write query pipelines).

I just thought of this now, but the client interface for a database like PostgreSQL is useful enough where other databases like CockroachDB can implement it too: https://www.cockroachlabs.com/blog/why-postgres/

Hope this helps :)

wearhere · on April 12, 2019

Calling a hosted database "serverless" is the most brazen branding I have seen in a long time. For extra hilarity, their pricing page says "pricing is inclusive of cloud hardware".

imveeve · on April 12, 2019

hi, this is Venkat from Rockset.

Good feedback. We thought about the different ways to frame the value prop and "serverless" is what resonated the most with us because: 1/ you can load data, process queries and build apps/dashboards without ever thinking about servers -- so, no provisioning or capacity planning required. 2/ you only pay for amount of data actually loaded and indexed -- so, no idle servers costing you $$$s.

If you have better suggestions that feels more accurate, please share and we will definitely consider it.

Touche on the "cloud hardware" bit. We will fix that soon.

wearhere · on April 12, 2019

Hey Venkat! Thanks for replying in good humor.

Now that you explain your reasoning a bit, and upon re-reading https://en.wikipedia.org/wiki/Serverless_computing, I think using "serverless" in this context makes sense. I see "serverless" used so much more often to describe compute runtimes like AWS Lambda than databases that, I confess, I thought you might be trying to ride that wave's popularity; and/or that you might be using "serverless" _just_ because the servers were managed by you not the users, whereas you allocate capacity on a more granular level than the server.

I do still recommend you take out the "cloud hardware" bit ;D

Thanks for the explanation, and best of luck! Cool model.

imveeve · on April 12, 2019

thanks.

'cloud hardware' was definitely LOL worthy. ... brb after i go fix it :)

itronitron · on April 12, 2019

Based on my initial read of your website it looks like you are in the same space as ElasticSearch and LucidWorks, although you don't seem to have non-SQL text search capabilities. It would be interesting to see a performance comparison between the three using SQL. I could see some customers wanting an SQL focused solution if there are performance gains to be had.

imveeve · on April 12, 2019

[this is Venkat from Rockset]

Your assessment here is spot on @itronitron

Our schemaless data ingest + automatic indexing definitely draws a lot of inspiration from search based systems such as ES and Solr. And yes, the biggest difference here is that Rockset allows

1/ full featured SQL (with fast JOINS, aggregations, sorts etc) on such semi-structured data sets, and that

2/ it is built ground up to exploit cloud economics and scale (which is why we are able to offer this as a serverless data management system).

Aeolun · on April 12, 2019

Honestly, this pricing scheme confuses the hell out of me.

Questions that initially pop up:

- What do I do if I need more QPS?

- What do I do if I need to ingest more data.

Why are these two values coupled to the cost of the data stored somehow?

imveeve · on April 12, 2019

[this is Venkat from Rockset]

Our goal is to make the default experience simple so that you get enough compute to build most real-world apps and dashboards. We still give you flexibility in case you want to purchase additional compute for ingest or queries. We will make this clear in our pricing page -- thanks for the feedback.

> - What do I do if I need more QPS? Barring extreme workloads (say 1 million QPS on 1 GB of data) for which we are not a good fit anyway, we auto-scale enough compute to handle the QPS needs of most real-world applications. As I mentioned earlier, if you want to break out of the standard compute allocation, then we do offer ability to purchase additional compute, but from our experience, this is seldom required.

> - What do I do if I need to ingest more data. Yes, you can purchase additional ingest bandwidth if you need a higher steady state ingest capacity. Please note that the bandwidth limit only applies to real-time streaming ingest — for bulk ingest (for example: the first time a collection is created in Rockset sourced from Amazon S3) we try to build the indexes at much higher speeds and will keep working on making that really really fast without any additional fees.

Aeolun · on April 12, 2019

Thank you for the explanation.

It would probably be good if you made all that clear on the pricing page though (along with the extra cost I would actually incur).

netvarun · on April 12, 2019

IIRC this is by one of the creators of RocksDB.

thinkingkong · on April 11, 2019

Cool! Pricing seems a tad high but I really like these types of products. When you take into account the act of running a database, finding the data transform or ingestion tools, and the reporting layer (I like metabase) plus the maintenance it starts to level out.

imveeve · on April 12, 2019

hi, this is Venkat from Rockset.

Yeah, we are fans of Metabase too and we will soon add support for connecting Metabase with Rockset. We do have Redash [1] and Superset [2], which are also pretty good and open source.

[1] https://docs.rockset.com/redash/ [2] https://docs.rockset.com/apache-superset/

scribu · on April 11, 2019

I'm curious how scalable this is. If it can't go beyond what you can load into memory, it doesn't seem that useful.

If you already know SQL, you could just as well use Pandas (the Python library) to load data from various sources and query it.

Also: AWS Athena

foxish · on April 11, 2019

Hi scribu, I'm Anirudh from the product team at Rockset.

The data is indexed onto SSDs in the cloud. The sweet spot is 10s of terabytes of data that you want to build a live application on top of.

From an architectural standpoint, we can scale even further. https://rockset.com/Rockset_Concepts_Design_Architecture.pdf

foxish · on April 11, 2019

In terms of use-cases, we see things differently from a product like Athena. We are focused on ETL-free real-time analytics and applications which make use of the fact that we construct multiple indexes automatically behind the scenes to enable low latency and scalable query serving.

manigandham · on April 12, 2019

1TB costs $6000/month at your pricing. 10s of TB is rather expensive. It's great if you can get those rates but I'm having trouble seeing how these numbers can work on large datasets.

foxish · on April 12, 2019

You are right in that it uses SSD and indexes under the hood to power the SQL queries. There are volume discounts available at higher storage volumes in the 10s of TB range, so, the cost wouldn't scale linearly.

Rockset builds multiple indexes to enable low latency SQL queries that can be served directly into applications. This might not make sense for say - storing a lot of log data and querying it rarely, and makes more sense for data that needs to be actively queried with low latency.

(I work on product at Rockset)

Aeolun · on April 12, 2019

If you are live querying 1TB of data in memory, $6000/month does not seem that crazy at all.

manigandham · on April 12, 2019

It's in memory now? The prior comment says SSDs.

Scarbutt · on April 11, 2019

Was curious about using pdfs as source but the docs have no info on it: https://docs.rockset.com/

dhruba_b · on April 11, 2019

https://rockset.com/blog/how-to-run-sql-on-pdf-files/