Hacker News new | past | comments | ask | show | jobs | submit login

> Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds

If someone has a code example to this effect, I'd be greatful.

I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".

I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.

It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.

It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance. Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.




An example using R code is here: https://arrow.apache.org/docs/r/articles/dataset.html

The speed comes from the raw speed of arrow, but also a 'trick'. If you apply a filter, this is pushed down to the raw parquet files so some don't need to be read at all due to the hive-style organisation

Another trick is that parquet files store some summary statistics in their metadata. This means, for example, that if you want to find the max of a column, only the metadata needs to be read, rather than the data itself.

I'm a Python user myself, but the code would be comparable on the Python side


You can see some of the benchmarks in DataFusion (part of the Arrow project and built with Arrow as the underlying in-memory format) https://github.com/apache/arrow-datafusion/blob/master/bench...

Disclaimer: I'm a committer on the Arrow project and contributor to DataFusion.


You can try the examples or datafusion with flight. I have been able to process data with that setup in Rust under milliseconds that usually takes tens of seconds with a distributed query engine. I think Rust combined with Arrow, Flight, Parquet can be a game changer for analytics after a decade of Java with Hadoop & co.


completely agree with this. Rust and arrow will be part of the next set of toolsets for data engineering. Spark is great and I use it every day but it's big and cumbersome to use. There are use-cases today that are being addressed by datafusion, duckdb, (to a certain extent, pandas).. that will continue to evolve.. hopefully ballista can mature to a point where it's a real spark alternative for distributed computations. Spark isn't standing still of course and we're already seeing a lot of different drop in C++ SQL engines.. but moving entirely away from the JVM would be a watershed, IMO


Clickhouse or DuckDB are databases I would look at that support this use case pretty much "out of the box"

E.g. https://benchmark.clickhouse.com has some query times for a 100 million row dataset.


DuckDB is so simple to work with. It's only worth to look elsewhere with real big data, or where you really need a client-server setup.

I hope it receives more love.


Duckdb is outrageously useful. Great on its own, but slots in perfectly reading and providing back arrow data frames, meaning you can seamlessly swap between tools when SQL for some parts and other tools better for others. Also very fast. I was able to throw away designs for multi machine setups as duckdb on its own was fast enough to not worry about anything else.


Having used all three I'd go with Clickhouse/DuckDB over Arrow every time.


Oh interesting - why?


They're easier to use and faster is the tl;dr.


100% agree.


Probably for SQL (top n, ...), but not for wrangling & analytics & ML & ai & viz


Here are some cookbook examples: https://arrow.apache.org/cookbook/py/data.html#group-a-table, https://arrow.apache.org/cookbook/. Datasets would probably be a good approach for the billions size, see: https://blog.djnavarro.net/posts/2022-11-30_unpacking-arrow-...


Generally, operating on raw numbers in a columnar layout is very very fast, even if you just write it as a straightforward loop.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: