What if it was 650TB? This article is obviously a microbenchmark. I work with much larger datasets, and neither awk nor DBD would make a difference to the overall architecture. You need a data catalog, and you need a clusters of jobs at scale, regardless of a data format library, or libraries.
1. Assume date is 8 bytes
2. Assume 64bit counters
So for each date in the dataset we need 16 bytes to accumulate the result.
That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.
This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.
I don't have a handle on the 33% longer runtime difference between duckdb and polars here.
I think the entire point of the article (reading forward a bit through the linked redshift files posts) is that almost nobody in the world uses datasets bigger than 100Tb, that when they do, they use a small subset anyway, and that 650Gb is a pretty reasonable approximation of the entire dataset most companies are even working with. Certainly in my experience as a data engineer, they're not often in the many terabytes. It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.
> It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.
Does this article demonstrate that though? I get, and agree, that a lot of people are using "big data" tools for datasets that are way too small to require it. But this article consists of exactly one very simple aggregation query. And even then it takes 16m to run (in the best case). As others have mentioned the long execution time is almost certainly dominated by IO because of limited network bandwidth, but network bandwidth is one of the resources you get more of in a distributed computing environment.
But my bigger issue is just that real analytical queries are often quite a bit more complicated than a simple count by timestamp. As soon as you start adding non-trivial compute to query, or multiple joins (and g*d forbid you have a nested-loop join in there somewhere), or sorting then the single node execution time is going to explode.
I completely agree, real world queries are complicated joins, aggregations, staged intermediary datasets, and further manipulations. Even if you start with a single coherent 650gb dataset, if you have a downstream product based on that, you will have multiple copies and iterations, which also have the reproducible, tracked in source control, and visualized in other tools in real time. Honestly, yes, parquet and duckdb make all this easier than awk. But, they still need to be integrated into a larger system.