> Learning more about a tool that can filter and aggregate two billion rows on a...

RobinL · on Jan 10, 2023

An example using R code is here: https://arrow.apache.org/docs/r/articles/dataset.html

The speed comes from the raw speed of arrow, but also a 'trick'. If you apply a filter, this is pushed down to the raw parquet files so some don't need to be read at all due to the hive-style organisation

Another trick is that parquet files store some summary statistics in their metadata. This means, for example, that if you want to find the max of a column, only the metadata needs to be read, rather than the data itself.

I'm a Python user myself, but the code would be comparable on the Python side

thinkharderdev · on Jan 10, 2023

You can see some of the benchmarks in DataFusion (part of the Arrow project and built with Arrow as the underlying in-memory format) https://github.com/apache/arrow-datafusion/blob/master/bench...

Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.

StreamBright · on Jan 10, 2023

You can try the examples or datafusion with flight. I have been able to process data with that setup in Rust under milliseconds that usually takes tens of seconds with a distributed query engine. I think Rust combined with Arrow, Flight, Parquet can be a game changer for analytics after a decade of Java with Hadoop & co.

cmollis · on Jan 10, 2023

completely agree with this. Rust and arrow will be part of the next set of toolsets for data engineering. Spark is great and I use it every day but it's big and cumbersome to use. There are use-cases today that are being addressed by datafusion, duckdb, (to a certain extent, pandas).. that will continue to evolve.. hopefully ballista can mature to a point where it's a real spark alternative for distributed computations. Spark isn't standing still of course and we're already seeing a lot of different drop in C++ SQL engines.. but moving entirely away from the JVM would be a watershed, IMO

tveita · on Jan 10, 2023

Clickhouse or DuckDB are databases I would look at that support this use case pretty much "out of the box"

E.g. https://benchmark.clickhouse.com has some query times for a 100 million row dataset.

spaniard89277 · on Jan 10, 2023

DuckDB is so simple to work with. It's only worth to look elsewhere with real big data, or where you really need a client-server setup.

I hope it receives more love.

IanCal · on Jan 10, 2023

Duckdb is outrageously useful. Great on its own, but slots in perfectly reading and providing back arrow data frames, meaning you can seamlessly swap between tools when SQL for some parts and other tools better for others. Also very fast. I was able to throw away designs for multi machine setups as duckdb on its own was fast enough to not worry about anything else.

intelVISA · on Jan 10, 2023

Having used all three I'd go with Clickhouse/DuckDB over Arrow every time.

sanderjd · on Jan 10, 2023

Oh interesting - why?

intelVISA · on Jan 10, 2023

They're easier to use and faster is the tl;dr.

nlittlepoole · on Jan 10, 2023

100% agree.

lmeyerov · on Jan 10, 2023

Probably for SQL (top n, ...), but not for wrangling & analytics & ML & ai & viz

mihevc · on Jan 10, 2023

Here are some cookbook examples: https://arrow.apache.org/cookbook/py/data.html#group-a-table, https://arrow.apache.org/cookbook/. Datasets would probably be a good approach for the billions size, see: https://blog.djnavarro.net/posts/2022-11-30_unpacking-arrow-...

cube2222 · on Jan 10, 2023

Generally, operating on raw numbers in a columnar layout is very very fast, even if you just write it as a straightforward loop.