> Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds
If someone has a code example to this effect, I'd be greatful.
I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".
I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.
It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.
It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance.
Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.
The speed comes from the raw speed of arrow, but also a 'trick'. If you apply a filter, this is pushed down to the raw parquet files so some don't need to be read at all due to the hive-style organisation
Another trick is that parquet files store some summary statistics in their metadata. This means, for example, that if you want to find the max of a column, only the metadata needs to be read, rather than the data itself.
I'm a Python user myself, but the code would be comparable on the Python side
You can try the examples or datafusion with flight. I have been able to process data with that setup in Rust under milliseconds that usually takes tens of seconds with a distributed query engine. I think Rust combined with Arrow, Flight, Parquet can be a game changer for analytics after a decade of Java with Hadoop & co.
completely agree with this. Rust and arrow will be part of the next set of toolsets for data engineering. Spark is great and I use it every day but it's big and cumbersome to use. There are use-cases today that are being addressed by datafusion, duckdb, (to a certain extent, pandas).. that will continue to evolve.. hopefully ballista can mature to a point where it's a real spark alternative for distributed computations. Spark isn't standing still of course and we're already seeing a lot of different drop in C++ SQL engines.. but moving entirely away from the JVM would be a watershed, IMO
Duckdb is outrageously useful. Great on its own, but slots in perfectly reading and providing back arrow data frames, meaning you can seamlessly swap between tools when SQL for some parts and other tools better for others. Also very fast. I was able to throw away designs for multi machine setups as duckdb on its own was fast enough to not worry about anything else.
If someone has a code example to this effect, I'd be greatful.
I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".
I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.
It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.
It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance. Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.