Hacker News new | past | comments | ask | show | jobs | submit login
Big data is dead (motherduck.com)
879 points by davidgomes on Feb 7, 2023 | hide | past | favorite | 433 comments



"For more than a decade now, the fact that people have a hard time gaining actionable insights from their data has been blamed on its size."

The real issue is that business people usually ignore what the data says. Wading through data takes a huge amount of thought, which is in short supply. Data Scientists are commonly disregarded by VPs in large corporations, despite the claims about being "data driven". Most corporate decision making is highly political, the needs of/whats best for the business is just one parameter in a complex equation.


I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

I did several experiments, and noticed that whenever I produced analysis that was in line with what management expected - my analysis was praised and widely disseminated. Nobody would even question data completeness, quality, whatever. They would pick some flashy metric like a percentage and run around with it.

Whenever my analysis contradicted - there was so much scrutiny in numbers, data quality, etc, and even after answering all questions and concerns - analysis would be tossed away as non-actionable/useless/etc.

if you want to succeed as a Data Scientist and be praised by management - you got to provide data analysis that supports managements ideas (however wrong or ineffective they might be).

Data Scientist's job is to launder management's intuition using quantitative methods :)


> if you want to succeed as a Data Scientist and be praised by management - you got to provide data analysis that supports managements ideas (however wrong or ineffective they might be).

> Data Scientist's job is to launder management's intuition using quantitative methods :)

It’s no different than the days when grey bearded wisemen would read the stars and weave a tale about the great glory that awaits the king if he proceeds with whatever he already wants to do.

The beards might be a bit shorter or nonexistent, but the story hasn’t changed.


If you [Croesus] go to war, a great empire will fall.


And the alternative is to use the data as bones, throw it up in the air and let it tell you what to do?


Absolutely. If you don't like what K-Means is telling you, change a variable and re-run. (that's one great thing about business data: there's no shortage of variables! True, there's usually a shortage of independent variables, but fixing that is difficult and underfunded).


And you'd better hope the bones actually say something useful.

I was the infra lead on a data lake project and got take part all the way to breaking down the data and turning into PowerBI reports. The result was "sell more" and to clients who marketing already identified, years ago, as whales.

There were some interesting other insights, esp. w/r/t to niche products that sold around weird dates (Easter, Memorial Day, 4th July -- but not obvious gift days like Valentines or X-Mas), but it led to a lot of "you're doing it wrong!" recriminations and follow up projects.


> Data Scientist's job is to launder management's intuition using quantitative methods

Ouch. This is savage, but sadly correct in many cases.

HOWEVER, to play devil's advocate here, I've also seen corporate data scientists overstate the conclusions / generalizability of their analysis. I've also seen data scientists fall prey to believing that their analysis proves would should be done, rather than what is likely to happen.

The role of an executive or decision maker is to apply a normative lens to problems. The role of the data scientist / economist / whatever is to reduce the uncertainty that an action will have the desired effect.


Yep. At this point, I essentially don't trust any ML result that shows > 95% accuracy.

So often, those models proved to be over-fitted and not generalizable.

But too many decision makers simply can't properly judge such results.


Good point. Data is one aspect of making a decision. The other aspect is understanding the industry and environment. Often data scientists give just one variable needed to make a decision. In health care for example you need to factor in a whole host of legislation. You also need to factor aspects of the industry not reflected in the data. As an example doctors not wanting to use iPads is something you can't measure and can't force as company. Even though data analysis might suggest this is the way to go.


> The role of an executive or decision maker is to apply a normative lens to problems. The role of the data scientist / economist / whatever is to reduce the uncertainty that an action will have the desired effect.

Where do business analysts fit into this dichotomy? Their whole job is to poke around in Tableau in order to surface high-ROI strategies for the business to pursue. (Where, in choosing which proposals to surface to management, they're effectively making 90% of the strategic decisions.)

Or how about corporate buyers in trading and retail companies?

Or quantitative investment managers?


People who poke around in Tableau might not get a lot of respect in the hierarchy of DataFolk, but descriptive statistics and thoughtfully chosen visualizations can be immensely useful. Exploratory data analysis sometimes reveals patterns that are so obvious that to apply statistical inference is just vanity.

If understanding the data generating processes is the goal, I'd rather see some useful plots than wade through a technical description of some model whose assumptions were flagrantly violated.


What does a “normative” lens mean?


As opposed to "postive". It's the old is-ought dichotomy https://en.m.wikipedia.org/wiki/Is%E2%80%93ought_problem.

Positive claims are about what is true. Normative claims are about what should be true, or rather what decisions we should make. Put another way, positive claims deal only with facts while normative claims deal also with values.

GP is saying that it's the data-scientist's job to give the executive the facts and it's the executive's job to decide what to do about the facts.


> I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

So, a synonym for 'consultant?' :)


My experience with consultants normally ended up with them asking why they are there and what report should they present to upper management.

I've always used them as "independent 3rd parties" who were listened to.


Here's my take on this not listening to the "expert":

A few years ago there was a problem with storm-water infiltration into my (elderly at the time) mother's property from her neighbor. I, being a dutiful son and a civil engineer, investigated it and came up with the probable cause, the likely effects of non-action and the most cost-effective solution. I presented it to my mother in the most layman-like terms that I could. She said she'd think about it – meaning she'd refer i.e. defer – to her daughters. In the meantime I had a very layman-like chat with my mother's carer and told her the situation in layman's terms. The carer listened and said that what I said I made total sense to her. Later on, one of my sisters accosted me and stated that it was completely obvious what the problem and the solution was – "even the carer could see it". Human foreheads don't have the real estate for where my eyebrows wanted to ascend.

My advice is to consider whether the message should be separated from the messenger somehow.


"A prophet is not welcome in his hometown"

This has been going on for mellennia, unfortunately.


This sounds a lot like how my kids will listen to a teacher/coach, but not their parents...


Which is similar to how a lot of parents won't listen to their kids but will listen to the coach, teacher, or priest.


My Mother-in-Law was called by a tech support scammer. Her bank was unwilling to accept their charges, and the scammer wanted her to call the bank to tell them to accept them anyway. My Brother-in-Law was telling her "no, this is a scam, do not do this", but she was unwilling to listen. Eventually, he told her to call me, thinking if she wouldn't listen to her son, maybe she'd listen to her son-in-law. Which she did.


Parents can be listening to their kids 99% of the time it will be transparent and uneventful. When there’s confrontation/divergence in opinion, by definition it didn’t work out through the usual channel, and of course a third party weighting in the balance will have visible effects.


As a consultant with roots in backend dev, I fully understand the scrutiny that we receive because unfortunately, it is often very warranted... It feels a bit refreshing to read your comment and see someone articulate what I am trying to convey to my clients. I am a tool, and yes, this pun is intended.


Hello

I want to make the move from development into consultancy and would appreciate hearing how you did it.

I cannot DM you, but if you have the time to type a small paragraph, I'm all ears.

If you don't want to make that public, you can directly email me (<HN-username>@gmail.com)


Sure, it is actually not very complicated in my case. I did backend development for a short while during and after university and then moved into IT consulting fairly quickly.

It was a LinkedIn recruiter message which I usually ignore. However, my SO did not (she is in IT as well) and convinced me to join a hiring event. I ended up liking it a lot and went through the hiring process. Soon, I started out on the most junior level and joined my first project with 3 very senior colleagues after a few weeks.

The learning curve was very steep both on the technical level and also regarding the consulting aspect - at first there was nothing I could 'consult' on due to lack of experience. This changed with growing experience, with the guidance of senior colleagues and my private efforts to gain skills and expertise.

Let me know if this was helpful!


This almost reads like my trajectory so far, but I'm at the point where I can't really consult due to the lack of experience, but I did make a good impression so far. Can I ask you, into what efforts should I put my private time? More technical knowledge? Into very fine details, or brief insights into different areas? Any good resources?

Thanks a lot!


In the news business, if your story or opinion backs up the preconceived notions of the investigative reporter then you are a 'source' otherwise you are a 'conspiracy theorist'.


All symptoms of the same problem..... you can hire McKinsey to confirm your priors, massage the data to confirm your priors, or anything in between.


A consultant with the data to back up their claims!


If Data Scientists are essentially in-house management consultants, I wonder which is cheaper?


This could be a reason why Data Scientist as a job title exploded in last years, every middle manager could afford one/two/few headcounts of data scientists to produce analysis that advances that middle manager's corporate agenda (more growth, empire building, expansion to certain de-novo areas, etc).

Recent tech layoffs is the other side of that growth, when cheap money is gone and company is forced to stick to core competencies and shutdown growth plans


This would match what psychologists say about humans in general: we feel first, then we use our brain to justify that feeling. We’re not rational beings.


I think the answer is simpler: people care about their careers and their family first. Think, "If the data says something that gets in the way of my career well I don't care about the data."

Had the same problem when I was an economics researcher -- publication bias for what stakeholders want to hear (often the government) is rampant because that's where funding for the economics department mostly comes from.


It's only rational. The company certainly doesn't care about that individual first, as evidenced by e.g. its decision to lay them off when it doesn't think the individual is serving them, so why should the individual put the company first?

This is also known as The Iron Law of Institutions.


We totally are, it's just that rationality is a tool, not a guide. If you want to work out the truth, rationality will help you do that, but if instead you want to justify a decision you already made, well, it'll help you do that too.


Hypothesis don't come from rationality either, they result from well informed intuition. All of the formality of science is about tricking ourselves into discovering our intuition is wrong using a rational series of steps even when everything in our nature is to use that ability to reason to do the opposite.


"Man is not a rational animal; he is a rationalizing animal."

-- Robert A. Heinlein, Tunnel in the Sky, 1955


A whole industry of emotional branding is thriving, systematically overloading our brains so it hurt to even think differently in the moment.

We are accepting a whole lot of assumptions every day.


or is it that psychologists feel that we aren't rational and use reason to justify this?

it isn't clear to me how the grounds for realizaing the theory are reconciliable with its conclusion


Thats because psychologists dont understand or choose to ignore how chemistry influences our personalities and emotions. An extremely simple example from the same medical/health profession is the use of SSRI's to make people feel happy. The legal system recognises how chemicals influence our feelings because of the laws that exists on illegal drugs or drink driving.

The definition of rational is being informed enough to know what said chemicals will do in the short term and long term in order to make an informed decision, but then I'm reminded we dont get taught any of the above unless we specialise at a Uni, so most people cant make any sort of informed decision.


I'd like to know more about "[Your] life under [your] state employed 'parents' and at the hands of other employees of the state"


Yes. I worked in the data org of a moderately sized financial firms tech org. The tech org claimed to be hugely data driven. Was in the org mottos and all of that.

Nonetheless, the CTO went on a multi-year, 10s of millions of dollars, huge data tech stack & staffing reorg shake up... with really zero data points explaining the driver, or what we would measure to determine it was successful.

So it became a self referential decision that we are successful by doing what he decided, and we are doing it because he decided it.


Huh. I've not thought of it as laundering, but I think you've basically summarised consulting in healthcare. Pay to legitimize and push through a pre-existing idea (eg let's close down a few ERs) or a delusion (e.g. lean, we don't need a waiting room) and say it was recommended by consultants to stakeholders and the public.


all consulting is like that, Partners/MDs at consulting companies meet with Board/CEOs to get rough idea of what they want/need, and quickly negotiates a consulting engagement contract to create PowerPoint with all the evidence and analysis gathered that supports CEO's initial idea.

This is the only reason why a 60+ PowerPoint slide deck can cost several millions dollars


Right, the more appropriate analogy is "parallel construction"...


Also a big reason McKinsey and BCG exists - provide cover for business plans intended by management to protect them from shareholder lawsuits. My friend did a sojourn at McKinsey and 6 months of his life was producing PowerPoints and memos backing up an expansion to AIPAC region. Was already happening but he was providing all manner of business justification for board meetings and whatnot.


This phenomenon is true to varying degrees in academic medicine (maybe all of academia) as well - personally have seen excellent data and methods disregarded when they don't confirm existing agendas. The choice for the researcher can become one of burning out trying to do good work and getting nowhere, or acquiesce and only present data that is uncontroversial. Huge existential threat to knowledge advancement.


"Data launderer" would be a good job title...


The data is not laundered. Preconceived ideas and biases are laundered and given scientific sounding justifications.


Concept Confirmer, Bias Booster?


Context Provider.


Affirmation Artificier?


Assessment Assurance, for double the bang.


Cherry Picker


Good role for all Perky Cheekers, surely!



Decision laundering. Take in dirty decisions and produce clean ones.


Most data is very dirty.


That would imply the data is clean when they are finished. GIGO


This isn't just "Data Scientist" but scientist as well. The more a finding is in contradiction, either with existing scientific consensus or even with just popular culture, the more the science is criticized. I've seen unequal criticism based on how much people wanted the results to be true/false and even after responding to the criticism I've seen people just ignore science they don't like.

The skepticism isn't a problem, the unequal application of it, the potential to harm careers, and the chilling effect as people wisen to how best meet their own personal goals is.


> I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

The SNAFU principle: communication is only possible between equals. When an hierarchical divide exists the subordinate will tell the superior what he wants to hear.


>The SNAFU principle: communication is only possible between equals. When an hierarchical divide exists the subordinate will tell the superior what he wants to hear.

Sadly true. As humorously depicted here[0]:

   In the beginning was the DEMO Project. And the Project was
   without form. And darkness was upon the staff members 
   thereof. So they spake unto their Division Head, saying, "It 
   is a crock of shit, and it stinks."

   And the Division Head spake unto his Department Head, 
   saying, "It is a crock of excrement and none may abide the 
   odor thereof." Now, the Department Head spake unto his 
   Directorate Head, saying, "It is a container of excrement, 
   and is very strong, such that none may abide before it." And
   it came to pass that the Directorate Head spake unto
   the Assistant Technical Director, saying, "It is a vessel of
   fertilizer and none may abide by its strength."

   And the assistant Technical Director spake thus unto the
   Technical Director, saying, "It containeth that which aids 
   growth and it is very strong." And, Lo, the Technical 
   Director spake then unto the Captain, saying, "The powerful 
   new Project will help promote the growth of the 
   laboratories."

   And the Captain looked down upon the Project, and He saw 
   that it was Good!

[0] https://www.anvari.org/fortune/Best_Fortunes/820_in-the-begi...


Same goes for economists and the politicians who sponsor them, just as it did for the astrologers and their patron kings.


Economics is the go-to for conservative, status-quo maintaining arguments because there's a wealth of statistics and information available for "how things have always been", and precious little-to-none for "how things could be if..."

It's easier to poke holes in predictions of the future than in interpretations of the past, especially when the people making those decisions have likely reached their decision-making status through "how things have always been".


I think this depends a lot on the org. In a place I used to work we collected and analysed a lot of data which convinced management to significantly change the spec of the product and spend a lot more time and effort on testing, because the product was being used in unexpected ways.

I would say it was a very engineering driven org however, so if you could present compelling data it could go a long way.


Authoritarian types consider any information derived by science which is contrary to their position as invalid or irrelevant because facts challenge their authority and ability to exercise control.


yes. I used to think the Church had a honest disagreement with Galileo about heliocentricity. When I grew up I realized the Church never cared about orbits at all, what they care about is maintenance of status quo.

And then when I got old, I realized, there is even a reason that some people want status quo... because they have usually been around long enough to see society fall apart into anarchy and mass murder, so in their mind, they are doing the right thing.


"The Church" wasn't then and isn't now a monolith of opinion.

A modern characterisation of "The Galileo Affair" would be that he was SWAT'ed by someone he was really really mean online to.

    Thus the whole "Galileo affair" starts as a conflict initiated by a secular Aristotelian philosopher, who, unable to silence Galileo by philosophical arguments, uses religion to achieve his aim.  [1]
and

    While delle Colombe was almost alone in arguing publicly against Galileo, there was a group of scholars and churchmen who supported his Aristotelian views. After Galileo referred disparagingly to delle Colombe as 'pippione' ('pigeon'), his close friend the painter Lodovico Cigoli coined the nickname 'Lega del Pippione' ('The Pigeon League') for delle Colombe's group.[2]
Galileo literally refered to delle Colombe (and friends) as Simplicio (simple minded) and worse in his highly popular Dialogue Concerning the Two Chief World Systems [3] and within a year or so the Pigeon League got their revenge, using their influence to have religuous charges bought against Galileo.

The affair was complex since very early on Pope Urban VIII had been a patron to Galileo and had given him permission to publish on the Copernican theory .. this was very much a case of personal vendettas and internal politics rather than a straight up case of "The Church Versus Galileo".

[1] https://en.wikipedia.org/wiki/Galileo_affair#cite_note-Spell...

[2] https://en.wikipedia.org/wiki/Lodovico_delle_Colombe

[3] https://en.wikipedia.org/wiki/Dialogue_Concerning_the_Two_Ch...


The church was not upset about heliocentrism. They were upset that Galileo was attempting to reinterpret the words of Bible in order to bolster his astronomical authority.


"[Therefore,] when God willed that at Joshua’s command the whole system of the world should rest and should remain for many hours in the same state, it sufficed to make the sun stand still. In this manner, by the stopping of the sun, the day could be lengthened on earth—which agrees exquisitely with the literal sense of the sacred text.”

- Galileo Galilei


>>I did several experiments, and noticed that whenever I produced analysis that was in line with what management expected - my analysis was praised and widely disseminated. Nobody would even question data completeness, quality, whatever. They would pick some flashy metric like a percentage and run around with it.

>> Whenever my analysis contradicted - there was so much scrutiny in numbers, data quality, etc, and even after answering all questions and concerns - analysis would be tossed away as non-actionable/useless/etc.

It's a good sign at the company that I run, anytime our analysts/data scientists come up with metrics that say we're killing it, or that our ideas should bear a ton of fruit, the kneejerk reaction is to be extremely skeptical of the results. Usually they're still right.

When the data scientists say we're fucking something up, we tend to pay a lot more attention.

Only the paranoid survive, after all.


An interesting essay that echoes these same sentiments:

https://ryxcommar.com/2022/11/27/goodbye-data-science/


Oh the experiment didn't go as expected? Rerun 5 more times with minor tweaks. It definitely not p-hacking ;).


I’ve been there, we wanted to release a feature, it kept coming back with issues that made it perform much worse than control, after 5 or so iterations with bug fixes it came back positive.

It took a lot of analysis and time to clarify to higher ups that we weren’t just P-hacking , but at least they were concerned about that.


sadly same with academia and funding sources


I agree with you. I call Data Scientists "soothsayers for the Pharoah".


> Data Scientist's job is to launder management's intuition using quantitative methods :)

https://www.youtube.com/watch?v=kAichhoZrKs


I wonder what a data scientist could really find out about executive (over?) compensation. employee compensation. working from home. office cubicle size and layout. tool expenditure for employees vs productivity.


How would you measure productivity at scale?


Well, obviously you ask the CEO.


> I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

Also someone to blame if it doesn't work out


As a data scientist at a large corporate I find this is often the push… but I resist every time and tell people what they don’t want to hear. Maybe I’m playing this whole corporate ladder thing wrong :/


"Data Scientists exist not to uncover insights ..."

That was a plot point in Dirk Gentley's Holistic Detective Agency (1989), though the observation much pre-dates this.


how is this really different from any other aspect of life? Very few people really like to be told counter information, and it is always easier when providing data that aligns with the current group think. Doesn't matter if it is business, politics, or really anything. Being the outlier trying to change the direction of things is a struggle.


110% my experience working in data science.

I found it incredibly stressful to discover and provide analyses (even experimental results) that wasn't expected, or contradicted prior beliefs. The findings were always very harshly scrutinized, and typically lead to tons of pointless extra work to 'understand what is going on'.


That's like a portrait artist that finds success by painting people more beautiful than they really are vs a starving one that paints them true to life due to sense of artistic integrity. Reminds me how Garth Brooks started doing metal after becoming a country music star.


Not just data scientists.

My friend runs a successful market research agency and she says she gets called in when management have decided they need to make a change but need evidence to sell it to the shareholders and staff.


So Big Data isn’t dead. It’s just found it’s place in society….


Sounds like a good way to weed out middle management ;)


Indeed, confirmation bias happens to almost everyone.


Bingo. Welcome to the world of office politics!


Reminds me of the book "Bullshit Jobs"


True - aka “are right a lot”


Government should do it, that way we can be sure it is honest and correct.


I mean, that makes sense does it not? If you're confirming something people already had a hunch about, why would they challenge it? And if it does go against their belief, they are going to want to make sure the data is correct before they change the course of the ship.


Agree with some of what you've said, but disagree with a lot:

> Most corporate decision making is highly political, the needs of/whats best for the business is just one parameter in a complex equation.

100% Individual humans are emotional creatures with their own wants and needs, and it's important to understand how organizational incentives drive decision making.

> Data Scientists are commonly disregarded by VPs in large corporations, despite the claims about being "data driven".

This has not been my experience, though. The more common thing I've seen is that, sometimes data is boring and doesn't really show much actionable insight, but as everyone wants to justify their job, I've seen data scientists come up with really questionable conclusions that fell apart on further inspection (call it "p-hacking the enterprise").

Plus, a lot of this data in these data wearhouses is messy. Often times data scientists are siloed at the end of the process, but then you get "garbage in/garbage out" results, where there is some bug in data tracking that isn't understood until it's too late. Much better in my opinion to have data engineers and data scientists work much more closely with product engineering teams up front so they can help ensure the data they collect is accurate.


What “data driven” normally looks like: business people ask questions that are not answerable, but the DS team often goes off and tries to answer them anyway. I’ve been in so many meetings, in multiple organizations, where a senior leader asks a question like, “why did this number go up? We need an owner to deep dive this.” Then, someone disappears for a few hours or days and comes back with some nice narrative that more or less might make some sense, and is directionally consistent. Then, best case, they go “hmm, ok” and worst case, they say it’s an interesting view and we need to add it to the tracking. Then, move on to a new question + action item, rinse and repeat. And then, 6 months or a year later, some brave soul goes, “geeze, there are way too many things to look at here, can we streamline this reporting to focus on the handful of most important metrics, and then assign owners to review the others and surface anything interesting?” And then all those things people asked for over time get stuffed into an appendix or deleted, and the whole thing starts all over again.


I find a lot of organizations don't have the discipline to harness whatever power their data may have. Sure collect everything, but god forbid you have any sort of data governance, or spend a single resource minute of time manually tagging or organizing or validating it. Then they try to make shitty ML models or products out of it, but don't care if the models actually work or not, just that they have AI now. Then a year later when the model has provided no value they are like, oh well big data is worthless I guess.


Palantir, you have to have Palantir.

Oh, and a bunch of data scientists with zero domain knowledge for whatever data they are analyzing, preferably with PhDs in maths, but some ML background will do. And agile, because of course all those Palantir dashboards can only be developed using agile.

Once all is said and done, zero insight was created but a whole lot of consultants, contractors and project managers have been paid handsomely, while some higher ups can now put "implemented agile and big data at X" on their LinkedIn profiles.


I'm one of those PhDs with zero domain knowledge analyzing data and I share the sentiment.

Most of my analyses provide very little value because they are sort of common sense to people with domain knowledge. When I ask people what could be more useful, one of two things usually happen: 1) it's impossible due to data and/or infrastructure limitations, 2) what they ask turns out to be nonsensical in further analysis (like asking for average of something that follows a very fat tailed distribution with a few observations dominating the phenomenon. Of course it's usually impossible to explain this to people).

The more I think about this, the more I think that in truly data powered companies, both the decision making and data analysis have to be carried out by more or less the same people. The organizational hierarchies have to be much flatter. Essentially the employees will have to be some kind of "secret agents" who have both the skills and the mandate to steer the company in the direction they see fit. I sort of see this already happening in the FAANG companies where, or so I hear, it's very difficult to get hired, the staff count is quite small compared to traditional companies and the senior engineers have a lot of power in the company.

Using math PhDs or Palantir or whatever as a sort of modularized black box for "insights" while giving them no real skin-in-the-game does not work.


I can confirm this way of worling with data from my time at Amazon operations. We used data all the time, everywhere and for everything. But we did not have data scientist in our time, we did it ourselves. Quite peculiar, but so damn effecient and effective. I kind of miss that. It also showed that most of the data analytics stuff I know, Six Sigma, is just plain overkill for a lot of practical applications.

My favorite example is the WW2 bomber diagram shown to illustrate survivorship bias. Sure, working from data and first principle one could identify the vulnerable spots of the bomber. Or one could asl the designers or have an engineer, heck even a contemporary field mechanic, take a look at the actual drawings of the plane. And reach the same conclusion, faster, with a ton of additional insight and improvement ideas that can actually be implemented...


Size isn't the real problem, it's time.

Are you going to take the time / money to set up a warehouse, get all the data into with an ETL product, set up dbt or some other transformation layer, set up a BI tool and build the reports and dashboards, etc.

Regardless the size of your data, you still need to get it in one place and model it in a way it's actually usable.


Exactly. It isn't just time to set up all the data in a way that makes the right query possible. It is also having queries fast enough to be able to run a vast number of them in order to find what you are looking for (or even things you were not looking for).

https://didgets.substack.com/p/data-science-and-serendipity


Is it queries on live data or data thats been moved usually?


It’s moving the data around that is slow… and expensive. Getting the data into the data warehouse, then getting it to the processors then moving it around to filter and transform.

Getting your data to the cloud is expensive, but then you can’t do anything with it because distributing it to process in multiple stages is too expensive and you’re already paying so much to keep all that useless data.


Can't we just give it to that one IT guy down in the basement?


Hey, I used to be that guy (and still am).


Agree on the size not being the issue. I transitioned from a data engineering manager to a data product manager at a new company. You know how much data a typical customer generates? Less than 1TB a year.

I told my VP that the engineering foul ups in the current product are easily fixable. Standard tooling and patterns exist to re-architect and solve the bottlenecks. What is much harder is a data architect to make sense of the complex data and make sure there is good value for our customers.

Guess what position I don't have on the team, and won't have due to budget issues.


Is that what's happening at Amazon as well? As they seemingly is loosing more and more track of the "Customer Obsession" schtick.


"customer obsession" was always at the mercy of the real obsession: "making money hand over fist". The former will ALWAYS lose out to the latter given enough cycles.


Yeah, "customer obsession" really just means "market share / growth obsession" which is a means to (eventually) making monopoly profits. Which Amazon seems to have achieved.


All the amazon corporate values are derived from making profit. Two-pizza teams? More like "three slices isn't frugal", one or two should be enough for you.


One may follow the other, but not vice versus.

It's a pretty strong argument to say that Microsoft under Gates was technically obsessed, but that really faltered under Balmer.

Microsoft continued to win profits, but they made major strategic missteps that cost the revenue.

Amazon feels like it's going down the same path: empowering the tree-gazers without remembering that the forest also matters.


You can also “coast upwards” for quite awhile - as an example we now pay Microsoft less than we were for periodic upgrades to Office for the entire Microsoft 365 suite (including email hosting, etc) but all the machines are now Macs. They make more from us in one way, but less in total dollars.


Exactly. I can't imagine some accountant hasn't come up with a way to quantify this.

{Revenue attributable to previous R&D} (aka coast) vs {Revenue attributable to current R&D} (aka acceleration)


They are encouraging their customers to have a bias toward action. Away from that asshole Bezos.


Their mission hasn't changed. They're still obsessed with customers, just not the way you think.


It's absolutely this. "Decision-based evidence-making" is what I've seen it called.


"We make decisions based on data, so let's use mine."


Take random words that Brent Spiner has said and mash them into a video.


The other thing to consider is that data simply has nothing of value. Part of the marketing of big data is the almost fairy tale belief in "insights" existing in any data set if you just look hard enough.


Correct, much data has no value. The cost for storing the data, maintaining the data [in the day-and-age of privacy requirements especially], and combing through the data is often much greater than the value obtained from the data itself.

The expertise we need in the industry is people who understand applications in-and-out and make great decisions on what data is worth keeping for present and future applications. And what data is needed to be kept, but only in aggregates (or anonymized, which reduces costs of maintenance)


Wait, are you telling me GIS data of deer poop distribution in my backyard has no practical value?

But I have thousands of data points, thousands, I tell ya!


"data" is usually just twisted to make leadership look good or justify what decision they wanted to make anyway. Analytics data is sliced at arbitrary time periods to make growth in whatever metric look good, certain subsets are just removed, etc.

doesn't help that most of this data goes through multiple layers of BS where each person is putting it through filters to make themselves look better. And a good chunk of people don't have enough understanding of stats to understand when they are being tricked


I think the recent industry layoffs reflect that in part.

Though I think there’s a better way— that is executive data science, what I do at Zapier. The key is that I’ve built up a huge amount of econometric and economic / business skills that I apply to affect good change in collaboration with company leaders. It allows me to work with Execs using sensible analysis. I improve growth and output by helping us catch errors of assumption before they go into production and cost growth / bad surprises. I also help the executives gain alignment around good information. This multiplies their departments’ output by allowing them to work better together, more in concert. That helps avoid issues with data being bent to decisions.

They typically carry a lot of hard-won valuable domain knowledge (that I combine with my economic-statistic knowledge and skills for rigor).

It’s my job to ensure Execs start with good sensible information regarding objectives. They usually ask fantastic questions and share a lot of great analysis of their own.

There are times I learn about what I might call controversial implications. This is typical of innovation using technology. It’s in these moments that I feel I create the most value by highlighting the trade offs I believe we face / potential regret.


That sounds amazing, what does your day to day consist of?


I get to apply my causal inference and measurement skills to a lot of compelling situations in collaboration with some really intelligent and skilled people. I truly feel like I get to perform technically excellent work, and I feel very privileged in that way.

My day to day is mostly economic / econometric analysis, conceptualization and formulation of technology, and writing, but also occasional meetings to gather information, feedback, answer questions, teach and collaborate with others.

I try to effectively “sample” the org by interacting with people up and down the org, so I can assist Execs in incorporating directors, managers and ICs knowledge / pool good information / systematize planning, etc.


> Data Scientists are commonly disregarded by VPs in large corporations, despite the claims about being "data driven".

I worked in a "data-driven" company of 5,000+ employees, with 4 data scientists who were spilt into two separate non-collaborating teams. In effect, they were so under resourced they got nothing done.


I've found this to be true in the (albeit small) # of samples I've run up against.

There's an agenda by some level of management, and they use "data" to forward their agenda, or disregard it due to "unexplainability" (a legitimate concern for BD/ML) if it disagrees with the agenda.


I would argue that insights from statistical models “[are] just one parameter in a complex equation.” No doubt PR from political issues can help the bottom line of a company. Can also hurt it. Just to be clear I’m not arguing political choices are inherently good or bad.


Replace “data scientists” with “software engineers” and you have another accurate insight. They don’t want to listen to us about how to write software or derive value from data.


> a huge amount of thought

Which itself both takes time and is wildly unpredictable - neither of play well with today's Taylorist managements schemes.


I don't want you to tell me what the data says, I want to tell you what the data says and you go find data that confirms it.

kthnxbye


It doesn't matter what the data says if it's telling you to go in a direction that is unethical.


No data is better than Bad data


Or even worse, bad understanding of data


deleted- length


of course it won't and it's really ironic that you reply with the next hype


I've made anecdotal observations similiar to this over the last 10 years. I work in AgTech. A big push for a while here has been "more and more more data". Sensor-the-heck out of your farm, and We'll Tell You Things(tm).

Most of what we as an industry are able to tell growers is stuff they already know or suspect. There is the occasional suprise or "Aha" moment where some correlation becomes apparent, but the thing about these is that once they've been observed and understood, the value of ongoing observation drops rapidly.

A great example of this is soil moisture sensors. Every farmer that puts these in goes geek-crazy for the first year or so. It's so cool to see charts that illustrate the effect of their irrigation efforts. They may even learn a little and make some adjustments. But once those adjustments and knowledge have been applied, it's not like they really need this ongoing telementry as much anymore. They'll check periodically (maybe) to continue to validate their new assumptions, but 3 years later, the probes are often forgotten and left to rot, or reduced in count.


I've observed the same in manufacturing … and fitness trackers a la FitBit.

There's initial value from training yourself on what something looks/feels like … but diminishing returns after that. Whether there is more value to be found doesn't seem to matter.

Factories would sensor up, go nuts with data, find one or two major insight, tire of data, and then just continue operating how they were before … but with a few new operational tools in their quiver.

Same is true of fitness trackers: you excitedly get one, learn how much you really are sitting(!), adjust your patterns, time passes … then one day you realize you haven't put it on for a week. It stays in the drawer.

Not unless they're threatened with ruin will people make changes to the standard way of doing things. This is actually … not bad! Continuity is important, and this is kind of a subconscious gating function to prevent deviation from a proven way of working. So, the change has to be so compelling or so pressing that they're forced to. Not a bad thing.

While we think things change overnight in this world, they generally take awhile … stay patient … it's worth it.


I went on a diet a few years ago. I obsessively recorded every food I ate in MyFitnessPal. To this day, I know roughly how many calories pretty much everything I eat is. So, I've learned from the process and don't need the process as much any more. (I'm kidding about that - it's easy to underestimate how much you eat, and an extra 200 cal a day adds up over the years.)


I did something similar in my early 20’s. 15 years later the weight is gone and hasn’t come back. It is pretty nice being able to eyeball my plate of food at a BBQ knowing it will not add any flesh to my mid section.

I definitely did not need a multi-PB storage array to reach that goal. Or did I? I am sure someone out there knows roughly the human brain storage capacity in MB.


This is interesting, and I think you're on to something. I got a fitbit when I got serious about running because I had no idea what it felt like to run in zone 2. I can read about it, but that only gets you so far. While running I could actively check my heartrate and adjust. A year into the hobby and now I don't bring my fitbit on probably 50% of my runs. I know how fast I need to go.


Where as a Garmin is a useful ongoing tool for me a runner. I have got through about six watches. I don't use gimmicks like step counters though, turned it off. "Congratulations you completed 10,000 steps!" - well duh, I ran 10 miles this morning.


if youre running 10 miles per morning, even half that, on the regular you're an extreme outlier. the vast, vast majority of folks don't get anywhere near that, and a step counter that gamifies making them move is a good thing.


I used to track sooo much health and fitness data... Then I realized it mostly wasn't actionable, or at least, I wasn't altering my decisions based on it. The answer was always, "more training." So I stopped.


I think they are only useful to runners, where weekly distance, speed and heart rate are massively useful. These are all very accurate things that can be captured, unlike weight resistance (swinging a kettlebell, dead lifting).


Your comment made me took my Fitbit out of drawer again :p At least I can have some fun number and graph about my health to look at when I want to.


Classic paper on soil moisture sensors (from 2010!) -- the title says it all:

"Mate, we don't need a chip to tell us the soil's dry"

https://doi.org/10.1145/1858171.1858211


I tend to think the problem is the "random digging for correlations" part.

Having tons of data is a Good Thing, so long as you can afford the marginal cost of gathering and managing all that data so that it's ready at hand when you need it later.

It's how you use the data that makes all the difference. If you're facing an issue you don't understand at all, don't go digging for random correlations in your mountain of data to find an explanation.

Think like a scientist: you need a valid hypothesis first! Once you have a hypothesis about what your issue might plausibly be, then you make a prediction: "If I'm right, I suspect our Foobar data will show very low values of Xyzzy around 3AM every weekday night". Only then do you go look at that specific data to confirm or refute the hypothesis. If you don't get a confirmation, you need to go back to hypothesizing and predicting before you look again. You can't prove causation by merely correlating data.


> It's how you use the data that makes all the difference. If you're facing an issue you don't understand at all, don't go digging for random correlations in your mountain of data to find an explanation.

Absolutely. But in my experience, there's this massive trend across the tech world that flat out rejects the value of domain/subject matter expertise. Instead, all you need is an engineer who can throw some ML at the uncurated mountain of data your organization has collected. Little to no value is placed on the resources that can frame an actionable hypothesis, even though the entire value proposition arises from this exercise!

Maybe I'm just jaded. I end up wasting a lot of time trying to re-direct data scientists and engineers down more appropriate pathways than if the problem they're solving was just brought to my attention earlier. Sorry, I understand you spent two weeks shoe-horning dataset X into our analysis system for your work, but it's invalid for the question you're asking - use dataset Y instead, and you'll have an answer in an hour or two.


> But in my experience, there's this massive trend across the tech world that flat out rejects the value of domain/subject matter expertise. Instead, all you need is an engineer who can throw some ML at the uncurated mountain of data your organization has collected. Little to no value is placed on the resources that can frame an actionable hypothesis, even though the entire value proposition arises from this exercise!

Sounds like the data scientists need to get together with the MBAs and they can do companies where nobody needs to actually know what they're doing.


No youre not. The next time i have to listen to some idiotic take that domain knowledge doesn't matter, i might shoot myself


This!!!! The amount of data scientists i come across who don't get this. If you're not knowledgeable enough to form reasonable hypothesis about the data, you have no business touching data


Even better... Once you have ~20 hypotheses, you'll get one right!


Fine-grained measurement is useful when you have options for fine-grained action.

You don't need a chip to tell you that the soil is dry, but if you can use that chip to regulate drip irrigation that can apply substantially different flow to different plants, then you can get a not-too-much, not-too-little watering even if you have a big variation in conditions.

You don't need a big analysis to acknowledge that everybody knows that a particular competitor has lower or higher prices and adjust your pricing; but doing that continuously on a per-product basis does require data and analysis.


Agreed. But how many executives will agree to take these fine-grained actions to achieve value from the data? How many data teams are able to build up a strong-enough argument to convince them?

I've worked on many product-led-growth initiatives in the software industry. The software industry is probably the biggest 'believer in data' there is -- many scientific-forward minds who understand the value. However, even in the software industry, it's really hard to convince folks that if you make 5 improvements that net 1% conversion gain each, you can dramatically improve revenue.


Most of the time this story is true, but think this way, the person that was using the system was an expert on the subject. If you can replace the expert with a person just looking at a graph from time to time to know if you have to irrigate the soils it's a different thing. Most of the data or ML tools show us something that the client as an expert already knows, but the true power of this tools is to give them to a non expert user and have roughly the same level of proficiency


I've been telling folks, storing everything all the time is wasteful, a better alternative is:

1. Keep the raw full data for short period of time, at most 1 month.

2. Downsample what you need for longer period of time (5-10% of the full data).

3. Aggregate your metrics on a yearly basis to save money and compute costs.


Yeah, there is perhaps a data usefulness duck curve. First data useful for specific immediate problems, not a lot of use for a while after that, then 15 or 20 years later, the big picture trends start to provide value for big decision making.

Not many orgs keep their data that long, though. Or even think about the future that far.


This analysis reminds me of the big interest in the use of hyperspectral imaging for agriculture. The idea was the greater spectral resolution (greater than Landsat) would result in more interesting information. Agriculture was one of the applications. But, once you did find the interesting stuff, you no longer needed a hyperspectral sensor. You could just look at one spot with a much lower cost sensor.

So hyperspectral, like big data, is useful up front. But in the end, much simpler tools and algorithms will solve the problem on a continuing basis.


Oh, those soil moisture sensors, they are so fascinating.

I spent a number of exciting year developing a high frequency soil impedance scanner and finally understood why I was doing it. To confirm the obvious :)


Interesting. Sounds as if what's really needed isn't so much collecting and analysing lots of data, but an alarm that's triggered when observations deviate from a set of assumptions. Observations that confirm some definition of "normalcy" -- as most observations would -- can be discarded.


I think there's a problem at the heart of the matter, specifically the idea that the act of measurement is in itself powerful when in point of fact that this isn't universally the case. As the old adage goes: "garbage in, garbage out." Even more troubling, there is a physical limit to our ability to model what we measure. Take the retina, it has around a million light receptors and even if you assumed they only have two valid states then you're left with around 10^300,000 bits of information to process, so good luck with that. Same thing applies to whatever firms are measuring and what they think is conveying relevant information as they'll have similarly exponential increases if they don't filter out the vast majority of irrelevant data points and states.


> it has around a million light receptors and even if you assumed they only have two valid states then you're left with around 10^300,000 bits of information to process...

That would only be a million bits (1 Mb). You're counting potential states, not bits.


Very interesting. Left AgTech last year but had similar experiences, even worse where often the single most prominent use-case was to follow some painful necessary documentation of ag inputs (chemical, seeds, fertiliser) to get subsidies. Real inputs from data? Nah!


But isn't this the essence of industrialization and automation? Messure, adjust process; repeat until feedback loop is stable - document and keep doing the thing that works, over and over?

If you want Toyota style continuous improvement you would need to improve in new areas of the process / new metrics, most of the time?


On the flip side, there's some great action coming from data insights. Look at Strella Biotech - they're putting sensors in sealed warehouses to detect spoilage for certain vegetables and fruits. That's something that can have great returns with just a few IoT devices and a novel sensor.


Or they could put on their work boots and go walk around the field and kick a few dirt clods, or science forbid! put a hand in the soil to check the moisture content.


> goes geek-crazy for the first year or so

The problem is that they don't stay geek-crazy?


While I get that they're sometimes useful to trigger debate, I don't really subscribe to very bold statements.

We are drowning in data, it's all around us. Information overload is real. Data enables most of our daily digital experiences, from operational data to insights in the form of user facing analytics. Data systems are the backbone of the digital life.

It's is an ocean and it's all about the vessel you pick to navigate it. I don't believe that the vessel should dictates the size of the ocean, it's simply constrained by it's capabilities. The trick is to pick the right vessel for the job, whether you want to go fast, go far or fish for insights (ok, I need to stop pushing on this metaphor )

This visionary paper from Michael Stonebreaker (2005) predicted it quite accurately and I think is still relevant: https://cs.brown.edu/~ugur/fits_all.pdf

Databases come in various flavours and the "trends" are simply a reflection of what the current era needs

Disclaimer: I work at ClickHouse


100% agree. One of the biggest assets we had at <driver and rider marketplace app> was the data we collected. We built models on it that would determine how markets were run and whether drivers and passengers were safe. These were key features that enabled us to bring a quality service to customers (over ye ol' taxi). The same applied to the autonomous cars, bikes, and scooters. We used data to improve placement of vehicles to help us anticipate and meet demand. It was insane how much data used to build these models.

To say big data is dead sounds to me like someone desperate for eyeballs.

I do think there is a huge opportunity for DuckDB - running analytics on 'not quite big data' is a market that has always existed and is arguably growing. I've seen way too many people trying to use Postgres for analyzing 10 Billion row tables and people booting up an EMR cluster to hit the same 10 Billion rows. There is a huge sweet spot for DuckDB here were you can grab a slice of the data you are interested in, take it home and slice and dice it as you please on your local computer. I did this just this weekend on DuckDB _and_ ClickHouse!

Disclaimer: I work at a company that is entirely based on ClickHouse.


Didn't know that Posthog is based on CH these days. Interesting!


Check the list of companies using ClickHouse: https://clickhouse.com/docs/en/introduction/adopters/


Really neat that you scour job postings to learn useful intelligence about companies using your product. I do this too :)

I'm curious how you have this set up. Is it currently a manual process or you use social monitoring tools to help you find mentions of ClickHouse in the wild?



Thanks for the reply :-) but your link is only for tracking mentions on the HN website.

I was asking about how they are able to track mentions, across the web, of companies using ClickHouse. This type of info is usually listed in the tech stack section of job descriptions (and these links tend to expire once the position is filled).


I guess the article title is a "bold statement" but maybe the biggest insight in there is that people don't think hard enough about throwing old data away, and it hurts them. This is a liferaft for drowning in data and is more "bold" organizationally, as it actually takes a certain kind of courage to realize you should just throw stuff away instead of succumb to the false comfort that "hey you never know when you might need it".

Weirdly there's a similar thing that can happen to codebases, specifically unit tests and test fixtures that outlive any of their original programmers, nobody understands what's actually being tested and before each release lose days/weeks hammering to "fix the test". The only solution is to throw it away, but good luck getting most teams to ever do that, because of the false comfort they get -- even though that fixture is now just testing itself and not protecting you from any actual bugs.

I mean how often does Netflix need to look a viewing habits from 2015? Summarize and throw it away.


I am baffled by this comment.

Throwing out unit tests? If you make a change and it fails a test, then you fix the bug or fix the test. I can't even imagine in what universe it's a good idea to throw away a test if it covers code in use. In what universe are unit tests "false comfort"? And if "nobody understands what's actually being tested" then you've got huge problems with your development practices.

Similarly, viewing habits from 2015 are tremendously important. There may be a show they're releasing soon that is most similar to a title released in 2015, and those stats will provide the best model. "Summarize" requires knowing how data will be used in the future, but will likely throw away what you need. Not to mention how useful and profitable vast quantities of data are for ML training.

Storing data is incredibly cheap. I'm actually curious where this desire to throw away old data comes from? I've literally never encountered it before, and it flies in the face of everything I've ever learned. The only context I know it from is data retention policies, but that's solely to limit legal liability.


Unit tests are only potentially of value if the code is changing. And 90% of code never changes. And 99% of unit tests never fail. Almost all of the value of unit tests come at the time of writing (a tiny percentage of) them.

After that, they become a liability that slows down builds, makes changes brittle and code based schlerotic.

A few good unit tests are a lot better than a bunch of bad ones. And even from your statement we can tell a much more pernicious risk — the false beliefs that code coverage measures whether code is tested and that a code coverage percentage is a mark of quality or safety in its own right.


The unit test story is indeed bizarre. Done right unit tests should test the unit, and you'll never hit these problems.

The villians here were monstrous test fixtures instead of mocks, "testing the fixture" instead of testing the code. Both were agency trading systems so "platforms" of a sort that needed significant refactoring to mock properly, so instead tests had to inject essentially fake concrete services.

Somehow I joined teams twice in my career that were trapped under this (who both indeed had "huge problems with their development practices") as their only coverage. The only way out is to write all new unit tests.


Mocks are the personification of bad data. The only meaningful measurement derived from tests with mocks is how bad the architecture is.


I don't know what you're criticizing here. I was contrasting "mocks" and "fixtures" in the context of unit tests as ways to instrument services depended on by the code under test.

A "mock" in this paradigm is some kind of testing technology that allows you to directly instrument return values for function calls on the dependent service, whereas a "fixture" is some concrete test-only thing you coded up to use in your tests.

If a fixture just acts as a dummy return-value provider, no problem (but you probably should have used a mocking solution). The problem that arises is fixture code that simulates some or all of the production service code, and/or (even worse) allowing modification of production code to allow use as a test fixture. This is the way to madness.


The heading is definitely “clickbait-ey” but the quality of the content was worth it. I probably would have missed the article without the headline. And I am already applying the insights gained.


This posting was great. Highly recommended reading through. It gets really good when the author hits "Data is a Liability".

> An alternate definition of Big Data is “when the cost of keeping data around is less than the cost of figuring out what to throw away.”

This is exactly it. It's way too hard to go through and make decisions about what to throw away. In many respects, companies are the ultimate hoarders and can't fathom throwing any data way, Just In Case.

Really appreciated the post overall. Very insightful.

As an anecdote to this article, when business folks have come up to me and asked about storing their data in a Big Data facility, I have never found the justification to recommend it. Like, if your data can fit into RAM, what exactly are we talking about Big Data for?


  In a larger sense, it's a challenge to throw away stuff, just as it's difficult to trim big data.  

  As I reach retirement, our attic, bookshelves, and cabinets must be trimmed -- and each item requires attention and a decision.
  
  Some things in the attic are obvious liabilities (what to do with a mercury barometer? A radium dial pocket watch? Old electronics?)  Disposing of other stuff requires time, insight, and a sense of the future (should we keep those fingerpainted scribbles from when the kids were 3?  How about those cheesy trophies from chess club? Computer books from the 1970's? Betamax home movies? Record albums?)


Where is this excerpt from? Did you write it yourself?


For what it's worth: I throw things away when I haven't touched them in a year.


> if your data can fit into RAM, what exactly are we talking about Big Data for?

That's a fantastic point, and I keep mentioning the COST paper to anyone who cares:

https://www.usenix.org/system/files/conference/hotos15/hotos...


COVID was proof The Right Data is better than Big Data. All those data sources to measure how many sick people we have and it turns out we just need one: Wastewater.


Or another way to look at it - if we make a data lake that collects everyone’s shit we might find something useful in it!


There is literally a post on front page on ChatGPT, and Microsoft and Google are preparing to duke it out starting in the _next 2 days_ over big-data generated 'chat' result.

Big data was never going to be useful to even medium size enterprises, unless anyone can get public access to PBs of data, but that doesn't mean big data is dead. ChatGPT is literally changing how school will test their students, for a start.

Maybe what the author is trying to say is 'small-scale big data is dead, but big data chugs on.'


>ChatGPT is literally changing how school will test their students, for a start.

Sure, instead of schools checking for plagiarism from other students' papers using turnitin.com, they'll check for plagiarism using ChatGPT tools that scan for known output from their industrial-scale amalgamation of plagiarized materials. Big whoop.


It appears that it's hard to detect AI generated content. E.g. true detection rates are only around 25% and there are also techniques to further mask output [1].

[1] https://www.nbcnews.com/tech/innovation/chatgpt-can-help-foo...


You only need enough to flag suspect content, and then the teacher calls the student in for a quick oral exam - the fakers will flounder, the reals will pass.


Even if you understand the material well, doing assigned tasks takes a lot of time. Especially if it's free form text. And "I want to do something more interesting right now" is at least as powerful a motivator to cheat assignments as "I don't know how to do this".


Aye, if I mostly know a subject why bother to take the time to write 10 pages when I can outsource that to a bot, and if they question me I can do a 20 min oral interview and save myself the trouble?

seems like a no-brainer in the long term.


I mean, that's certainly part of it. But you're also seeing very fast restructuring of individual courses (because programs overall will definitely take years to restructure because higher education moves so damned slow) to account for these tools.

In the small institution I am currently working with, the English courses, in one week, integrated chatgpt as a tool for students to work with. It's part of the collaborative idea building and development process now for every student enrolled in creative writing and writing analysis classes, and that happened in one week. I cannot stress enough how unbelievably fast that is for higher ed. That's faster than light speed.

And we're not even that well resourced. I have to imagine there are other examples where it's more than just running through a bot to scan for known outputs.


Sounds like they simply panicked and threw something together with little thought or preparation, what, right in the middle of an actual course? And they want to charge kids for this kind of 'expert instruction'? I'd be pissed as a student.


What a weird assumption you made there. In what way does what I wrote sound panicked? Because it was a week? Yes it was fast, but it was a massive effort of the entire english faculty.

It's integrated into existing assignments, modifying processes that are super well established already. It was like integrating a new person into the class. Also, it was before semester, so the students literally saw nothing weird; that was another strange assumption for you to make.

Just a tip, and don't read tone in this statement, but don't assume things. 9/10 times you're going to be incorrect. It's much better to ask questions, instead of making statements with question marks at the end of them.


I simply don’t buy that they had anywhere close to enough time to integrate this into an existing curriculum in a way that would meet the high standards paying students should have for a university education. I don’t have to ask how an English department became experts in using a just released AI in the classroom in a week because I don’t think that’s what they actually achieved.


Again, stop putting words in my mouth, please. I never said they became experts. I said they integrated it into existing processes. I said that they were doing something other than just scanning for chatgpt hits like plagiarism checkers.

>It's part of the collaborative idea building and development process now for every student enrolled in creative writing and writing analysis classes

>It's integrated into existing assignments, modifying processes that are super well established already. It was like integrating a new person into the class.

I don't honestly know how else to say that. I legitimately do not know how to help you understand what I'm saying.


Not who you are replying to. I found your example fascinating - would you be able to share one or two concrete examples of this integration you mentioned? I would like a low-level peek or two into how the teaching landscape is changing in light of the rise of LLMs like chatGPT.


Sure - I'll use one of the creative writing classes.

In the past, the class would be centered around ideas and themes the class came up with together during the first week of the semester. They would then read and discuss short stories from various authors centered around that theme, preferably from different eras and/or cultures. From there, they would work in pairs/small groups to flush out original ideas they came up with for their own stories based on the themes and styles from the first month of reading. Then they would work individually to write the stories. Finally, they would come back together to edit and work through that process as a group, with a reading and discussion in the last week or two.

Now, the class works alone the first day or two to discuss themes with chatgpt, to identify relevant and appropriate literature (if possible), and to flush out initial ideas for the readings and discussion topics. Then we come back together for group work to figure out themes for the semester and possible readings, like it used to be, but primed with whatever information they had already discussed with Chatgpt. From there they work through the readings, using chatgpt to bounce ideas for discussion off of before coming to class. Then they work in teams, like before, to flush out their own stories, but again, using chatgpt as if it were another member of the group. They also have to track their question/comment and response, to evaluate their own thought processes and look for weak links in their reasoning and logic. Students are then free to use the software to edit their writing before coming together with their creative assignments.

We haven't made it past the step of reading, as it's still the first month of the semester. But,the discussion for the first section has been much more in-depth and (to use a word that is impossible to quantify) vibrant. The students had already aired the ideas they thought may be dumb, and would therefore be less willing to voice in a public setting; this allows them to really dive further into whatever is in each story, and connect dots between stories that generally took a week or two. Because they have another 'person' to talk to whenever they want, and however much they want, they tend to really get into the work. Further, and unexpectedly (and anecdotally sadly) it has allowed a couple of students I know personally who would not have been willing to participate in public discussion (anxiety disorder and TBI) a stronger voice, because they know their ideas are flushed out already, and it provides them with a 'script' of ideas they know are novel and valuable for the class. In other words, if they were able to get the idea from Chatgpt, they knew they had more work to do to build that thought, because that is really just a baseline.

I'm interested to see how the actual writing process goes. Chatgpt is okay at creative writing, but not at the level that we expect to see. Some faculty expressed concerns that students would just have the software do the work for them, but I'm not really bothered by that. First off, if someone can figure out how to make a career out of publishing chatgpt prompts, well, good for them. Second, if the software is able to write better than the students are, they don't really belong in this class in particular (graduate level writing class).

Anyway, like I said earlier in the thread - we're really just treating this as another 'person' in the class, but who is available to everyone, all the time. It's beneficial for the brainstorming sections, but not for the hard skills, from what I can tell. I am most excited to watch them evaluate their own thought processes when working with chatgpt to flush out their ideas. In the past, this has been a barrier, because every single person I’ve ever met never gives a partner 100% of their thoughts all the time. They always hold back. My theory is that, because it’s a dead-end conversation, they will be more willing to push into new topics, and really think hard about what they’re trying to say. We’ll see if that stands.


I agree a lot with your analysis - thanks for sharing your insights.


It sounds like they’re chasing hype, like every idiotic modern US movement to “reform” education.


> don't assume things. 9/10 times you're going to be incorrect

Isn't that... an assumption?


No, it's an assertion. It's like the bastard cousin of an assumption, in that it's only incorrect 8.67 times out of 10.


It's an assertion of an assumption, unless you've gone and measured.


Wow, instead of learning and being creative, just write a out whatever the generic safe whitewashed chatbot says?

Why pay for the college?


Is it crazy to think that instead of stepping up in the war against AI we instead try to figure out a way to teach kids assuming they will use AIs?


Are we trying to produce adults who are able to think critically and creatively, and who reach their full intellectual potential, or are we trying to produce adults who can push a few buttons and blindly believe what the machine tells them?


While likely not how it would be work out in practice, you would hope that with better tools would also come higher standards. If you expect more complex, more thorough, and/or less error-prone output from students using AI then you don’t necessarily have to lower how much critical and creative insight they need to have. Like the difference in a test that does and doesn’t allow calculators, you always have to fit the assignments to the tools that are used for them.


> In fact, [writing] will introduce forgetfulness into the soul of those who learn it: they will not practice using their memory because they will put their trust in writing, which is external and depends on signs that belong to others, instead of trying to remember from the inside, completely on their own.

- attributed to Socrates by Plato. c.399-347 BCE. “Phaedrus.”


Pretty sure there's a fallacy named after this whole "hey this is just exactly like before so we have nothing to be concerned about".


The point is that all technology is a tool. Whether it be writing, calculators, or various narrow AI software. We can either bemoan the loss of a now-less-useful skill (memorization, long division, longform writing), or learn how to use these tools to better achieve our goals.


Technologies develop for different purposes and have different effects. Modern digital technology has a very inhumane origin story, as noted in “New Dark Age” by James Bridle, and other works. https://jamesbridle.com/books/new-dark-age


The scene from the film Idiocracy where the main character is being triaged in a doctor's office comes to mind.


The short story MANNA about AI directed folks and the future that creates comes to mind:

https://marshallbrain.com/manna1


Definitely the second one


The average human hates changes of things they've grown used to.

People are very attached to school being just like it was when they went.


As long as it pushes the students into mental effort I have no problem with changes in education. The world changes, it’s normal to have changes. But there’s nothing wrong to compare against prior models, sometimes we revisit abandoned ideas, artworks, models etc


Isn't the problem that kids (like all of us) are lazy and would rather use tools than their brain, rather than people trying to keep school "just like it was"?


Learn your times tables and then learn to use a calculator.

If you don’t do it that way, you’ll always be fooled by whatever the screen tells you.


That's how we teach today, but that doesn't mean it's the only way. As a counter example, in chess, AI makes a great learning partner, and from beginners to world champions the level of proficiency has increased immensely in the last 30 years due to better chess engines.


All math homework through the high school level is now as simple as figuring out how to describe it to ChatGPT (or maybe ChatGPT 2.0 for particularly tricky examples). Paper-writing is now a matter of figuring out how to rephrase LLM output in your own words to get around any watermarking or pattern detection.


Wolfram alpha has been around for math cheats (and people like me who just needed a more visual representation to learn) for a while now. Including proof of work.


Years before wolfram alpha, we had TI-89s with computer algebra systems for cheating your way through highschool math.


Oh yeah, it's why most of my classes restricted us to TI-83s. The TI-89 was restricted in schools to basically calc and above, and the TI-92 was just banned. Lol


In my school TI-89s were unrestricted, but I think that was mostly because teachers were only trained with 83s and assumed the 89 had equivalent capabilities. The SAT permitting the 89 probably had something to do with it too, since the 92 was banned (because qwerty as I understand it.)


Maybe this should give you a hint that said work is completely pointless and merely exists to waste student's time.


You can't figure out how to describe a math problem correctly to ChatGPT without already knowing the solution.


what if teachers ask chatgpt how best to test students despite the existence of a tool like chatgpt enabling cheating


> industrial-scale amalgamation of plagiarized materials

More Big Data!


Or just require students to use software that keeps track of version history or require spyware installed.


Go back to longhand. Even if they are plagiarizing, they might learn something from the exercise of rewriting by hand.


That's solved by putting a pen on your 3d printer.


Or a second-hand Cricut...


That is what the author said. From the article: "Big Data is real, but most people may not need to worry about it."


> ChatGPT is literally changing how school will test their students, for a start.

Here's a novel idea: test students using pen and paper?


Or, preferably, admit that testing wasn't a good idea to begin with and focus on optimizing children for learning, not test-taking.


I don't even think we're just talking about children either. Test taking in academia (university and above) could stand a much needed fresh look.

I am hopeful that a change happens in academia to prepare students for jobs, which is why they are going to school in the first place. Yes, students need to learn how to "think", but really they are wanting to get the technical skills to perform their duties more than anything.

We have bestowed too much credence in traditional academia not useful to the average person or average job. College is a "game" for most students, and they put up going through the motions of testing, etc. for the sake of the diploma at the end.

I hope we're going to enter a new era of what college means for those looking to get something different out of it.


At some point you need to see what people know to measure their progress and help them.


Teachers assign better scores to papers with better penmanship. I forget how strong the effect was, but using a keyboard does help equalize some biases.


Why equalize that bias?


Because there are a plethora of disabilities which make neat penmanship difficult.


Scan the hand-written work back into a digital format and present the results to the teachers for evaluation in their preferred typeface.


Keyboards and screens in an exam hall then?


Typewriters for everyone!


In Italy we test with pen and paper and the quality of italian schools is abysmal. The point is not the test, is the quality of the teachers. They are the only hope to form good humans, not some standardized test.


I kind of doubt they trained chatgpt on petabytes of application logs and web server logs. Is keeping all of this crap even useful for more than a small amount of time at this scale?

Actual good information will always be useful, most of this "big data" seems to be the equivalent of recording background static.


Big Data drives the most profitable and society bending changes of all time, just to serve us better Ads.


Okay, Google as a company and as a product is definitely in the top 1%, or top 0.0001% where big data drives profitable and bending changes ;-)


Yes, this occurred to me as well. The counter narrative here is in fact that the story of the last 2-3 years has been break throughs in AI have come about mostly by scaling up their network sizes and training data sets 5 orders of magnitude or so.

I guess the take away however is still that regular businesses really just can't play in this game and should not be assuming they have big data until that fact asserts itself out of necessity rather than the other way around.


That's a completely different topic. "Data" is obviously a pretty generic term and "large sets of data" are going to be more and more relevant to the world in general. What he's talking about is the Big Data trend in industry specifically around Business Intelligence (BI). That is, collecting as much data as possible on your users to optimize your product experience and profits. Tracking clicks, purchases, form dropoffs, email opens, ad impressions. It's mostly going to be first-party data (ie what did they do with our own products and content).

ChatGPT and the like are not going to get much use from that kind of data and instead are looking at a giant corpus of text and images scraped from a variety of public sources to infer what humans might think sounds smart. It's possible the two worlds will meet, but that's probably not what's going to be announced this week.


I wonder how long until training today’s ChatGPT will cost $1000 of AWS compute. 10 years?

At that point, does it keep scaling or is there an S curve where 100x more data and compute only leads to a 2x improvement?


> At that point, does it keep scaling or is there an S curve where 100x more data and compute only leads to a 2x improvement?

Careful with the scales. 2x improvement could be interpreted from 80% human performance to 160% human performance. Or going from 10% error rate to 5% (which again crosses into superhuman territory on some tasks). Those last few bits are the critical ones.


We would need to see incredible advances in energy efficiency for that to happen.


We are! Don't even have to reach for fusion potentially being commercial technology to show it. Solar is already approaching $0.03/kilowatt hour and likely to be half of that by the end of the decade. Energy getting very cheap coupled with computing capacity continuing to go way up is going to enable lots of interesting new technologies beyond LLMs


I think you're underestimating the scale by a few orders of magnitude.


> Maybe what the author is trying to say is 'small-scale big data is dead, but big data chugs on.'

That's pretty much exactly what the author says in the article.


OpenAI and Google are clearly in the 1% as TFA describes.


MotherDuck has been making the rounds with a big funding announcement [1], and a lot of posts like this one. As a life-long data industry person, I agree with nearly all of what Jordan and Ryan are saying. It all tracks with my personal experience on both the customer and vendor side of "Big Data".

That being said, what's the product? The website says "Commercializing DuckDB", but that doesn't give much of an idea of what they're offering. DuckDB is already super easy to use out of the box, so what's their value-add? It's still a super young company, so I'm sure all that is being figured out as we speak, but if any MotherDuckers are on here, I'd love to hear more about the actual thing that you're building.

[1]: https://techcrunch.com/2022/11/15/motherduck-secures-investm...


We're being a bit hand-wavy with the offering while we're in "build" mode, because we don't want to sell vaporware. DuckDB is easy to use out of the box, but so is Postgres, and there are plenty of folks building interesting cloud services using Postgres, from Aurora to Neon. And as many people will point out, DuckDB is not a data platform on its own.

For a preview of what we're doing, on the technical side, a couple of our engineers gave a talk at DuckCon last week in Brussels, it is on youtube here: https://www.youtube.com/watch?v=tNNaG7e8_n8

(for context I'm the author of this blog post and co-founder of MotherDuck)


Deliberately speculating so someone will correct it: I'd guess they'll make a bunch of enterprise tools to do things like: enable access and synch the data in a way which complies with various policy, encrypt/tokenize/hide certain columns etc, monitor queries, ensure data is encrypted at rest, stuff like that.

Assuming the above it true: I'll bet the reason they aren't so loud about exactly what they are doing is they want to get a head start on it. In theory anyone can build this stuff around DuckDB. From a marketing perspective the clever thing to do would be drive up usage of DuckDB while they build out all this functionality and then the minute corporates start seeing problems with their people using it (compliance etc), they have the solutions.


I'd wager you're right. All the "boring" stuff that's actually very complicated/difficult, and without which no large enterprise will adopt a technology.


Especially since enterprise companies hate the idea of shifting large amounts of highly sensitive company data onto commonly lost and misplaced work laptops.

If you're going to do that you better have your security and governance on point.


Shoot me a note and I'm happy to fill you in!

(PM at MotherDuck)


Is there a reason you can't post it here?


Would love to! How do I get in touch? My contact info is in my profile.


done!


If you are willing to fill me in, I'd love to hear what you guys are up to. My username is my gmail username too.


> DuckDB is already super easy to use out of the box, so what's their value-add?

I think this is analytics equivalent of edge computing. Instead of one big-cluster cruching numbers.

1. User requests bunch of analytics

2. Server assembles a duckdb file

3. Sends this down to users laptop

4. User runs local queries on the duckfile

5. Go to step 1 for more analytics


I agree with many of the points here.

My cheap no-name old laptop SSD writes with 170MB/s.

A customer has a name, address, email and order. Let's say 200 bytes for each. That means I can write 844000 new customers per second, far outside my personal marketing reach.

My disk is 240GB, which means I can store data for 1.2 billion customers. It'll take a while until I become that successful.


Presumably the "order" you mention is a primary key to another table, likely one that references the individual items that make up that order, so the data will be much larger than you estimate.

It will grow larger still if you include web logs from your e-commerce site and event data from your mobile app so that you can correlate these orders with items that customers considered but ultimately didn't buy. How will your laptop and SSD perform when you then build a user-item matrix to generate product recommendations for each of those 1.2 billion customers?

While plenty of organizations unnecessarily use Big Data tools to store and analyze relatively small amounts of data, there are plenty of customers with enough data to require them. I've seen plenty of them firsthand.


There are functionally less than 1000 organizations that currently require distributed compute for data analysis. You can get off the shelf AWS units with 1000 cores, terabytes of ram and storage, etc. The cost of compute has decreased faster than the amount of data we have to store and process. What we used to do with spark jobs we can do with python on a single box.


Let's assume your completely made-up 1000 organisations claim is true.

Right now I work for one of them: a global investment bank.

Within that organisation we have at least 100+ Spark clusters across the organisation doing distributed compute. And at least in our teams we have tight SLAs where a simple Python script simply can't deliver the results quick enough. Those jobs underpins 10s of billions of dollars in revenue and so for us money is not important, performance is.

So 1000 x 100 = 100,000 teams, all of whom I speak for, disagree with you.


Disagree with what? I never said _you_ are a dummy for using distributed compute. There are many good applications for distributed compute. I used spark and flink at a big tech job. The stack worked well for some things, and for others it was a hammer looking for a nail. What you do not see is that for every team that you work with and consider a peer group to you, there are 100 teams that really do not need distributed compute, because they have an org wide infra budget of <3M dollars and a total addressable data lake of less than 1TB, but they are implementing very expensive distributed compute solutions recommended from either a Deloitte consultant or a very junior engineer. Should an IB with an infra budget in the 100M+ infra budget zone use distributed compute solutions, absolutely. There just aren't that many of these orgs.


This is not true. Any column store database (bigquery, Redshift, snowflake) implements distributed compute behind the scenes. When an analyst/business intelligence people have a query return in 3 seconds instead of 15 seconds, it's actually huge. Not just in aggregate amount of time saved, but in creating a quick feedback loop in testing hypothesizes. This is especially true considering that most analyst type people look at data as aggregates across some dimension (e.g. sales per month , unique visitors per region, etc...)

These types of questions are orders of magnitude faster with a distributed backend.


Yup.

I was just playing with some data from our manufacturing system, about 30 GB. I pulled the data to my laptop (very expensive Apple one) and while it fits on my disk just fine, it took about 15 minutes to download.

I imported it to ClickHouse which took a while due to figuring out whatever compression and LowCardinality() and so on. I ran a query and it took ClickHouse about 15 seconds. DuckDB pointed to the parquet files on my SSD took 19 seconds to do the same. Our big data tool took 2 seconds, while working with data directly in cloud storage.

Now of course this is entirely unfair - the big data thingie has over twenty times more CPUs than my laptop, and cloud storage is also quite fast when accessed from many machines at once. If I ran ClickHouse or DuckDB on 100 CPU machine with terabyte of RAM it might have still turned out faster.

But this experiment (I was thinking of using some of the new fancy tech to serve interactive applications with less latency) made me realize that big data is still a thing. This was a sample - one building from one site, which we have quite a few of.


I'd love to understand the shape of this data and some of the types of queries you're performing. It would be very helpful as we build our product here at motherduck.

I have no doubt that there are situations where the cloud will be faster, especially when provisioned for max usage [which many companies do not]. However, there are a lot of these situations even where the local machine can supplement the cloud resources [think re decisions a query planner can make].

Feel free to reach out at ryan at motherduck if you want to chat more.


> You can get off the shelf AWS units with 1000 cores, terabytes of ram and storage, etc.

Hold your horses... the beefiest servers that are in production today, unless you count custom-made stuff go to somewhere between 128 and 256 cores per board. These are hugely expensive. Also, I don't know if you can rent those from Amazon.

Typical, affordable servers range between 4..16 cores. Doesn't matter if you buy them yourself, or you rent them from Amazon. It's much cheaper to command a fleet of affordable servers than to deal with a high-end few. This is both because the one-time price of buying is quite different and because with smaller individual servers you have a fighting chance to scale your application with demand. Especially this is true in case of Amazon as you could theoretically buy spot instances and by so doing you'd share the (financial) load with other Amazon's customers.

Now... storage. Well, you see, in Amazon you can get very expensive storage that's guaranteed to be "directly" attached to the CPU you rent, the so-called ephemeral storage. This is the storage that's included with the VM image you use. It's very hard to get a lot of it. I couldn't find the numbers for Amazon, instead, I know that Azure tops out at 2 TB. In principle, this kind of storage cannot exceed a single disk, so, think Amazon probably offers the same 2 TB, maybe 4. But, again, it's cheaper to have a bunch of EBS's attached... but then you'll have to have more of them as the latency will suffer, and in order to compensate for that you would try to increase throughput, perhaps.

Also, think that, in practice, you'd want to have a RAID, probably RAID5, and this means you need upwards from 3 disks. Also, if you are using something like a relational database, you'd most likely want to put the OS on a single device, the database data on a RAID and database journal on a yet another device, and, probably, you'd want that device to be something like persistent memory / optane / something from higher-tier disks with dedicated power supply. And all this is not due to size, but due to different contingencies you need to have in order to prevent huge data loss... Now, add to this backups and snapshots, perhaps replication in 2-3 different geographical areas if you are running an international business... and that's quite a bill to foot.

There are similar problems with memory, since there can only be so many legs on memory bus and only so many pieces of memory you can attach to a single CPU, and if you also want a lot of storage, then, similarly, there can be only so many individual storage devices attached and so on.

Bottom line... even to reproduce the performance of your laptop in the cloud you would probably end up with some distributed solution, and you would still struggle with latency.


Azure has the LS series of VMS [1] which can have up to ten 1.92TB disks attached directly to the CPU using NVMe. We use these for spilling big data to disk during high-performance computation, rather than single-machine persistence, so we also don't bother with RAID replication in the first place.

Though it is a bit disappointing that while Microsoft advertises this as "good for distributed persistent stores", there are no obvious SLAs that I could rely on for actually trusting such a cluster with my data persistence.

[1]: https://learn.microsoft.com/en-us/azure/virtual-machines/las...


Well, attaching 10 PCIe devices is going to give a very hard time to your CPU if all of them should be used. The speed of copying from memory or between devices will become a bottleneck. Another problem is that on such a machine you will also need huge amount of memory to allow for copying to work. And, if you want this to work well, you'd need some high-end hardware to be actually able to pull that off. In such a system, your CPU will prevent you from exploiting the possible benefits of parallelization. It seems beefy, but it's entirely possible that a distributed solution you could build with a fraction of the cost would perform just as well.

This situation may not be reflected in Azure pricing (the calculator gives 7.68 $/h for L80as_v3) since if MS has such hardware, it would be a waste for it to stand idle. They'd be incentivized to rent it out event at a discount (their main profit is from traffic anyways). So, you may not be getting an adequate reading of the situation, if you are trying to judge it by the price (rather than cost). But, this is only the price of the VM, I'm scared to think about how much you'd pay if you actually utilize it to its full potential.

Also, since it claims to have 80 vCPUs, well... it's either a very expensive server, or it's, again, a distributed system, where you simply don't see the distributed part. I haven't dealt with such hardware firsthand, but we have in our DC a Gigaio PCIe TOR switch which would allow you to have that much memory (in principle, we don't use it like that) in a single VM. That thing with the rest of the hardware setup costs some six-digit number of dollars. I imagine something similar must exist for CPU sharing / aggregation.


Ha! On a side note (from the page you linked):

> The high throughput and IOPS of the local disk makes the Lasv3-series VMs ideal for NoSQL stores such as Apache Cassandra and MongoDB.

This is cringe-worthy. Cassandra is an abysmal quality product when it comes to performing I/O. It cannot saturate the system at all... I mean, for example, if you take old-reliable PostgreSQL or MySQL, then with a lot of effort you may get them to dedicate up to 30% CPU time to I/O. Where the reason for relatively low utilization (compared to direct writes to disk) is the need to synchronize that's not well-aligned with how the disk may want to deal with destaging.

Cassandra is in a class of its own when it comes to I/O. You'd be happy to hit 2-3% CPU utilization in the same context where PostgreSQL would hit 30%. I have no idea what it's doing to cause such poor performance, but if I had to guess, some application logic... making some expensive calculations sequentially with I/O, or just waiting in mutexes...

So, yeah... someone who wanted Cassandra to perform well would probably need that kind of a beefy machine :D But whether that's a sound advise -- I don't know.


Citations please? That's a pretty bold statement to make in the face of observed reality.


Even if this is off by two orders of magnitude and it's only 100,000 companies that need distributed compute, that means that almost all companies just need a single large computer.

Looking at the distribution of companies by employee count and assuming that data scales with employee count (dangerous assumption, but probably true enough on average), that means that companies don't need distributed compute until they get several hundred employees. [0]

[0] https://www.statista.com/statistics/487741/number-of-firms-i...



This is such a lazy response.

I/O performance is just one of many characteristics that impact performance and from experience the one you least need to worry about. RAID 0 across multiple high-end NVME drives with OS file caching is going to be more than fast enough for most use cases.

The issue is running out of CPU performance and being able to seamlessly scale up/down compute with live running workloads.


A large computer is radically CPU overprovisioned for most workloads.


But we aren't talking about most workloads.


But ... we are ... basically by definition. Vanishing little projects actually need cloud scale infrastructure.

And, to address your previous statement: one beefy server is actually pretty scalable. Soft threads spin up in microseconds to serve incoming requests, communication between threads is blazing fast, caching is simpler on one machine, etc. You don't even have to worry to much about scaling, the CPU just throttles itself when there is no load.

And every once in a while you just upgrade to the next gen beefy machine.


Don't forget the cool JS library you included to track mouse movements so you can optimize your UI to make sure Important Money Making Things are easily clickable.

That's 8.4 hojillion megabytes per second right there.


That's still well within 1U server with some RAM and bunch of NVMes reach


> 170MB/s

That is not random access speed. For random access my relatively high-performance SSD only does 42MB/s reading and 80MB/s writing.


Indeed there's probably some caching going on.


One of my "computers are really fucking fast" experiments, almost a decade ago, was when I was trying to do a histogram plot of a function that I was 98% sure was terribly broken. It was expected to give a uniform distribution so I figured I'll just plot a bunch of values into a 2d space and then convert it to a greyscale image.

At first I tried to puzzle out a good sampling strategy to make sure I didn't bias the output, then on a whim I tried 2^32 samples and went to lunch. It took something like a half an hour to do 4 billion samples. Took me a couple times to figure out how to squeeze 4k megapixels into a graph so I ran it a few more times, but the results showed a very distinct banding pattern that confirmed that the problem was every bit as bad as I suspected, which was a blocking issue for our release. A couple of hours well spent, running through an 'intractable problem' that really wasn't.


It never sat well with me that none of the production services could leverage my local computation and storage power. I don't need to store my contacts on a remote server that could index my contacts when mixed with every other contact in a single table. That’s a blatantly oversimplified example but you get the gist.


Developing apps as local-mostly with remote being "just storage" might've been interesting approach but oh so many stuff moved to webshit from native apps and browsers still don't even have decent data management.


Well said! I wonder if Web3 could solve such a problem (or a zero trust solution). Where you provide your service that can run in a special container


I don't see incentive to host a bunch of stranger's stuff on your machine. The moment you make it easy and "bulletproof", the bad kind of content nobody wants coming from their IP will come with it.


I see it all the time: people develop applications that will never ever get a database size of over 100GB and are using big data databases or distributed cloud databases. Often queries only hit a small subset of the date (one customer, one user). So you could easily fit everything into one SQL database.

Using any of the traditional SQL databases takes away a lot of complications. You can do transactions, you can query whatever you want, …

And if the database may get up to 1TB, still no problem with SQL. If exceed that, you may need a professional OPs team for your database and a few giant servers, but they should easily be able to go up to 10 TB, offload some queries to secondary servers, …


I think a lot of data tech has come full circle is now mostly just relational databases. Our org is invested in redshift which lets us mostly pay as we go. The DB itself is just a Postgres facade on scalable storage with some native connectors to file stores and third-parties. After rolling over our stack like three times, we're now just dumping tons of raw data into staging tables, then creating views on top of them. It's 97% raw SQL with a smattering of python for clunky extractions. And we're now true believers in ELT vs ETL.


Redshift with S3 storage is no different to Spark SQL with S3 storage.

Both are distributed compute. Except that Spark allows you to mix/match code with SQL.


I think a key driver of this is not having to use SQL. I like DynamoDB and EdgeDB because I can use a more modern and reasonable language to interact with the database.


That’s a good point, I also think that there should be some modern alternative to SQL. I really like how you can query databases with LinqPad (c#) and how it renders it into a nested table tree. All relations are clickable/expandable, so if you find something interesting in your result set, you can just expand additional rows from other tables. In the background it just creates sql via an ORM, not only once I more or less copy and pasted that generated sql into a view.

But linqpad is not useful if you don’t get the pro version, only then you get code completion. So it’s not really the answer to the problem.


its really difficult to do any kind of analysis without relational queries. The standard way you do this is to have an app datastore in DDB, and an ETL job that pipes your data into some data warehouse env.


EdgeDB works really well for relational queries, it's a graph native query language that renders into Postgresql. Check it out: https://www.edgedb.com/


The less than a terabyte datasets being common had me awestruck. I, singular post-doctoral scientist noobermin[0], have processed terabytes of data at a time on HPC systems. Sure, a lot of it was garbage and I had to wade through it, but no one paid me millions to do it, I just did it to publish the papers. Sure, I needed the system which cost someone a lot of money, I suppose. But, I considered myself a small fry compared to some of the things others did on the system, particularly, hyrdrodynamics modellers. Moreover, I know I can probably process 100GB datasets on my own home PC, which isn't too impressive, it would just take longer (say a day or so instead of a hour or a few minutes). And this is with idk, python scripts using MPI. Yes, MPI because I'm a computational scientist and that's what HPC systems use, nothing fancy and likely the "legacy systems" he railed against in his pitches, but it worked.

I'm just awestruck, I could tell anyone that "large data" isn't really a bottleneck, but making sense of it is the very difficult part. My mentors kept pushing me to mention the sheer size of the datasets I process in talks because it sounds impressive, and I do do so, but I always knew it didn't matter because the interpretation and analysis is the hard part, not just the "sheer size."

[0] not going to use my real name


> I'm just awestruck, I could tell anyone that "large data" isn't really a bottleneck, but making sense of it is the very difficult part

Especially now that 1TB datasets fit in memory on off-the-shelf servers and 100GB fits in memory on consumer hardware. You need a lot of data to run into real technical challenges that can't be solved by throwing a couple hundreed dollars a month (amortized cost) at hardware. And often you can get by with much, much less than even that.


Not to mention if you're physically storing it (not in the cloud) 1TB can fit on your pinky finger


That's a different category of big data. I worked for a big pharma and they were building their big data department with Spark and friends. I was quite surprised that their biggest dataset had something like 200 GB.

At the same time, though, there was a lot of DNA sequencing data, we were designing CRISPR probes etc. But Spark and Hadoop aren't really that helpful in this area, so the Big Data team wasn't involved in those.


I think it does depend on the problem. Genetic stuff always seemed not easily parallelizable like my field (physics simulation) is. That said, the culture here is that MPI works and thus cray still builds computers that work better with it, so we use MPI so it works...etc etc.


> Genetic stuff always seemed not easily parallelizable

Not really. Processing one sample may take a few hours but if you have hundreds or more samples, it's an obvious axis for independent parallelization.

The cool kids use Nextflow or CWL these days. It's something like `make` - it remembers what you've already computed and what the dependencies are - but it uses a batch engine like SGE/Condor/AWS Batch to actually execute the jobs.


Big Data was whatever someone couldn't handle in a spreadsheet or on their laptop using R.

This paper is 8 years old and it was somewhat obvious then.

Scalability! But at what COST? https://www.usenix.org/system/files/conference/hotos15/hotos...

A big single machine can handle 98% of peoples data reduction needs. This has always been true. Just because your laptop only has 16GB doesn't mean you need a Hadoop (or Spark, or Snowflake) cluster.

And it was always in the best interest of the BD vendors and Cloud vendors to say, "collect it all" and analyze on/or using our platform.

The future of data analysis is doing it at the point of use and incorporating it into your system directly. Your actionable insights should be ON your grafana dashboard seconds after the event occurred.


My experience with "Big Data" is it was something that couldn't be handled in a spreadsheet or on their laptop using R because it was so inefficiently coded.

I got sucked into "weekly key metric takes over 14 hours to run on our multi-node kubernetes cluster" a while back. I'm not sure how many nodes it actually used, nor did I really care.

Digging into it, the python code ingested about ~50GB of various files, made well over a dozen copies of everything, leaving the whole thing extremely memory starved. I replaced almost all of the program with some "grep | sed | awk | sed | grep" abomination that stripped about 98% of the unnecessary info first and it ran in under 2 minutes on my laptop. I probably should have tightened it up more but I was more than happy to wash my hands of the whole thing by that point.

Instead of improving the code, they just kept tossing more compute at it. Still heard all kinds of grumbling about os.system('grep | sed | awk | sed | grep') not being "pythonic" and "bad practice"; but not enough that they actually bothered to fix it.


That is one of the selling points of Hadoop, you can write garbage code and scale your way out of any problem, turning the $$$ knob up to more nodes.


Yeah, that's why I got involved (I was infrastructure at the time) - how can we throw more hardware at it as the kubernetes setup they had wasn't cutting it.

One of the "data scientists" point blank said in a meeting "My time is too valuable to be spent optimizing the code, I should be solving problems. We can always just buy more hardware".

Admittedly the last little bit of analysis was pretty cool, but >>99% of that runtime was massaging all of the data into a format that allowed the last step to happen.


Snowflake too.

Inefficient sql? Crank the virtual warehouse.


Spending more money on compute also makes you and the team look more important, an d the problems being tackled, more challenging.


You can do a petabytes of analysis with regular old BigQuery just as easily as you can analyze megabytes of data. This solves the scalability issue for a lot of companies, IMHO.


I agree, BQ is a gem on GCP. You pay for storage (or not, you can use federated queries) and don't pay anything when you aren't using it. The ability to dynamically scale reservations is pretty nice as well.


big data isn't big anymore.

1) 10 years ago, having access to 300tb of data that could sustain 10gigabytes/s of throughput would require something like two racks of disks with some SSD cache and junk.

2) people thought hadoop was a good idea

3) People assumed that everything could be solved with map:reduce

3) machine learning was much less of a thing.

4) people realised that postgres does virtually everything that mongo claimed it could.

5) people realised that cassandra was a very expensive way to make a write only database.

I gave a talk about using big data, and basically at the time the best definition I could come up with was "anything that's too big to reasonably fit in one computer. so think 4, 60 disk direct attached SAS boxes".

Most of the time people were chasing the stuff for the CV, rather than actually stopping to think if it was a good idea. (think k8s two years ago, chatGPT now, chat bots in 2020). Most buisnesses just wanted metrics, and instead of building metrics into the app, they decided to boil the ocean by parsing unstructured logs.

Not surprisingly it turned to shit pretty quick. Nowadays people are much better at building metrics generation directly into apps, so its much easier to easily plot and correlate stuff.


what is your current explanation for why hadoop turned out NOT to be a good idea and everything couldn't be solved with map:reduce?


Detailed web event telemetry is where I have seen the "biggest" data, not application-generated data. Orders, customers, products will always be within reasonable limits. Generating 100s of events (and their associated properties) for every single page/app view to track impressions, clicks, scrolls, page-quality measurements can get you to billions of rows and TBs of data pretty quickly for a moderately popular site. Convincing technical leaders to delete old, unused data has been difficult; convincing product owners to instrument fewer events is even harder.


I love DuckDB and am cheering for MotherDuck, but I think bragging about how fast you can query small data is really no different than bragging about big data. In reality, big data's success is not about data volume. It's about enabling people to effectively collaborate on data and share a single source of truth.

I don't know much about MotherDuck's plans, but I hope they're focused on making it as easy to collaborate on "small data" as Snowflake/etc. have made it to collaborate on "big data".


Some of mongo's leveling off is the adoption of good jsonb columns in postgres.

mongo's got sharding out of the box - which is nice - but you have to get your key right or it will suck.

Also no one should want to host a mongo db - unless that's your business.


MongoDB grew revenue 52.8% in the previous financial year [1].

And if there is any levelling off it's going to be because of the move towards cloud managed options e.g. Snowflake, DocumentDB rather than because PostgreSQL decided to add JSONB support.

[1] https://www.macrotrends.net/stocks/charts/MDB/mongodb/revenu...


"Shards are the secret ingredient in the webscale sauce": https://www.youtube.com/watch?v=b2F-DItXtZs


This reminds me of a great blog post by Frank McSherry (Materialize, timely dataflow, etc) talking about how using the right tools on a laptop could beat out a bunch of these JVM distributed querying tools because... data locality basically.

https://github.com/frankmcsherry/blog/blob/master/posts/2015...


Big Data is dead? Seems well and alive to me. If you're not a big company with big customers, it never affected you to begin with.


Big Data is far from dead. On the contrary, people (on most daily projects) are more mindful now wrt all Big Data liabilities and benefits (infrastructure cost vs. what you get from it) thanks to the experience of the failed ones. But many analytics companies are thriving.

Also, using BigQuery as a metric of how Big Data is used is, IMHO, wrong. Real analytics companies usually have custom solutions because BigQuery is too expensive for any serious usage unless you are Google.


I am writing an essay series on this topic: last-mile analytics and how an abundance of data must be ultimately converted into (measurably correct) action.

If anyone wants to follow along, the series is here!

https://alexpetralia.com/2023/01/19/working-with-data-from-s...


That looks like a huge undertaking, but kudos for taking the time. I'll be following along. Totally agree that all data should be tied to the business value that it's driving.

Unfortunately, I've found that many data teams focus more on making the data clean and available. They never drive the conversation about what actions are being taken with the data. That leads to them being treated as cost centers. Wrote a similar post about my perspective on it - https://bytesdataaction.substack.com/p/transform-your-data-t...

I'd love to chat about the space more with you if you're interested! Email in bio.


This sounds like "sane planning, sensible tomorrow." Book for Al gore


Not that big data is dead, more like real time data is coming to life, but you need the old stuff around to make a buck or two… Well, that my view. LLMs are transformer model technique are making data more relevant than ever. If you are a business, well you are in for a “now real” digital transformation.

Making data the centerpiece of your business business could mean that your effectiveness of business process could increase several order of magnitudes. Funny thing is, you will not use some else’s model, unless you are building a ChatBox to infer, but you will need to build your own model and be trained in your own business process to be successful.

Consider a bank, here is my prediction of expected outcomes:

Enhanced Customer Experience: The system can act as a virtual banking assistant, providing customers with instant access to their account information, real-time transactions, and balance updates. The system can also answer customer inquiries and provide relevant information, improving the overall customer experience. Improved Fraud Detection: The system can monitor the bank's financial transactions in real-time and identify any potential fraud, helping the bank reduce its exposure to financial losses.

Automated Loan Processing: The system can analyze loan applications, credit scores, and other relevant data to approve or reject loan applications in real-time, reducing the time and effort required for manual loan processing. Personalized Marketing: The system can analyze customer behavior, transaction history, and demographic information to provide personalized marketing and cross-selling opportunities, increasing the bank's revenue and customer loyalty.

Real-Time Insights: The system can provide real-time insights into the bank's financial performance, customer behavior, and market trends, enabling the bank to make informed decisions and respond to market changes quickly.

What is interesting to me is, this is just the beginning of what could be…


Yeah, I've noticed more applications just need to focus on making sense of raw information really quickly, but usually don't need an archive to make decisions.

There are lots of interesting things that can happen with "big streaming" than necessarily "big data". Like, cybersecurity is evolving to monitoring and reacting what everyone's machine is doing in the last 15 minutes, instead of having a huge database of hashes you trust. But not a ton of things really utilize what happened, say, 10 years ago on people's machines.

There's definitely some things that can use massive archives of old data, but I have found far, far fewer things that would benefit from it, and often that comes with some very big maintenance hassles. Most of the time, you can just set data retention to 30 days and be done.


I assume you've never actually worked at a bank.

They've been working to implement your ideas for decades and none of it requires LLMs or any machine learning techniques. Basic old ETL is more than sufficient.

The issue is that (a) the calculations they need to perform are complex and take time to run (b) there are financial regulations that weave its way through those system and (c) there is a lot of legacy code especially in the core ledger system which "just works" and people are reluctant to touch.

That said depending on your bank you can get real-time account activity, loan approvals in < 5 minutes etc.


Well, that is an understatement. I do agree with you that banks have been trying to fix decades old application.

But in this process, you don’t need ETL, nor all the process and development to accomplish these ideas. Conceptually the idea builds its self (it learns) how to threat the data, quite revealing and near real time. Considering you account for security and privacy, then you basically shift your input into the data stream and using a natural language get the data output you need, not clunky apps.

Imagine I just login, and say: me>how much do I have? bank>You have 100$ me>Please send 50$ to 1003 bank> Are you sure? Please add your security code to confirm

bla bla…

All this with little intervention.

Banks spend hundreds of man hours developing a lacking application while delivering a very poor customer experience. They spend millions on running decades old applications because it so expensive to change them… and thus the circle continues…

I’m really exited to see DataBases disappear conceptualy, data entry, mostly all that just disappear… I will ask my ChapBot for statement, give me a personal investment advice, and classify all my purchases and see where my wife has been spending all my money, all from the confort of my phone.

it’s a brave new wold we are wakening up to, that to me is exciting. And coming from having helped several major banks build their infrastructure, it’s just a boost to talk about something fresh, no more Hypervisor, core count, db licenses, ect. Ok, I’ll concede it’s pretty much the same old, just the nemonics will be different… How many GPUs, how quickly can you spin a container, how fast if your S3 datastore… oh wait, there is that circle again… >:D


So you're not actually talking about back-ends system but about the front-end.

In that case, chat-bots have existed for years and consumers largely don't like them.

In your scenario you can transfer money in a few clicks rather than having to write out an entire conversation.


> Are you in the big data one percent?

Exactly, and I'd go further.

Are you in the perf/scale/data one percent?

So many people worry about scaling when in reality 99% of web apps will never reach above 100reqs/s.

I've been in web dev for 20+ years. Only once when working for a big international corporate client I had to worry about traffic spikes. And that was just for one of their multiple web apps.


Who has ever believed those claims? There's a common saying "garbage in, garbage out" about what happens with all those fancy models if the data quality is not high. That's really independent from dataset-size. There's no magic insight you get because your dataset is bigger. You need a quality analyst to handle your data, irrelevant of its size.

Also, who thought their company would cease to function because surely they will hit google-scale dataset-sizes in the near future? Impossible for most except the biggest of the biggest


It is amusing that in 2005, "VLDB" (precursor term to "big data") was defined in Wikipedia to be "larger than 1TB".. after reading through the post and the author's experience.. it would appear that this was not actually a completely terrible estimate, although there are larger and smaller: https://en.wikipedia.org/w/index.php?title=Very_large_databa...

The current version of that article states: "There is no absolute amount of data that can be cited. For example, one cannot say that any database with more than 1 TB of data is considered a VLDB. This absolute amount of data has varied over time as computer processing, storage and backup methods have become better able to handle larger amounts of data.[5] That said, VLDB issues may start to appear when 1 TB is approached,[8][9] and are more than likely to have appeared as 30 TB or so is exceeded.[10]" https://en.wikipedia.org/wiki/Very_large_database


Not dead, just complying with the Gartner cycle for hypes.

There is probably a rational, well thought out classification of different types of data bigness, as in CERN-big, Google-big, MegaBank-big, down to wordpress-log big and on the basis of that one would probably find that different designs are indispensable, address different pain points and cannot really "die". Hype has a more erratic lifecycle than real needs


This is an excellent summary, but it glosses over part of the problem (perhaps because the author has an obvious, and often quite good solution, namely DuckDB).

The implicit problem is that even if the dataset fits in memory, the software processing that data often uses more RAM than the machine has. And unlike using too much CPU, which just slows you down, using too much memory means your process is either dead or so slow it may as well be. It's _really easy_ to use way too much memory with e.g. Pandas. And there's three ways to approach this:

* As mentioned in the article, throw more money at the problem with cloud VMs. This gets expensive at scale, and can be a pain, and (unless you pursue the next two solutions) is in some sense a workaround.

* Better data processing tools: Use a smart enough tool that it can use efficient query planning and streaming algorithms to limit data usage. There's DuckDB, obviously, and Polars; here's a writeup I did showing how Polars uses much less memory than Pandas for the same query: https://pythonspeed.com/articles/polars-memory-pandas/

* Better visibility/observability: Make it easier to actually see where memory usage is coming from, so that the problems can be fixed. It's often very difficult to get good visibility here, partially because the tooling for performance and memory is often biased towards web apps, that have different requirements than data processing. In particular, the bottleneck is _peak_ memory, which requires a particular kind of memory profiling.

In the Python world, relevant memory profilers are pretty new. The most popular open source one at this point is Memray (https://bloomberg.github.io/memray/), but I also maintain Fil (https://pythonspeed.com/fil/). Both can give you visibility into sources of memory usage that was previous painfully difficult to get. On the commercial side, I'm working on https://sciagraph.com, which does memory and also performance profiling for Python data processing applications, and is designed to support running in development but also in production.


The title might be hyperbole (intentionally), but the observations are more or less in line with what I experienced through a few the Big Data initiatives over the years under different enterprise environments (although I have reservation about the one 1%er comment). To me, Big Data was never about how "big" the data was, but more about the tools/system/practice needed to overcome the limitation of the previous generation. From that perspective, yes, the "monolith" may be having a "coming back" for now due to the improvement of underlying single node performance. But I do think Data size will keep growing, everything needed to make Big Data work would still be there when the pendulum swings back where a single node can't handle it anymore.


I feel like big data has rarely lived in most organizations. My own experience working in large orgs largely supports the point that collected data is rarely queried. But this is rarely due to a lack of interest, it is mostly because a) nobody really has a great overview over what even is collected b) even if you know/assume something is collected, you usually have no idea where c) if you find the data, there is a decent chance that it is in some sort of weird format that requires a ton of processing to be usable.

This has been - to varying extends - my own experience working in large organizations that don't have tech as their core business.

Although there are some successful data analysis project, the potential of the collected data remains largely underutilized.


> Most data is rarely queried

Right on point. In the past I have been obsessed with big data, looking for insights. Then I realized that a medium-sized specific data set is always better than a gargantuan general big data monster. There is so many applications in my field where only outliers matter anyways, and everything is very "centralized" to a few relevant observations. So the only thing about big data is that you maybe throw away 99.9% of the data right away and then you have some observations that you actually care about. There is soooo much data out there that is just noise, and so little that I actually care about. And that's why I still end up hand collecting stuff every now and then.


I believe we are living in the "emotional era", so data has being ignored and 'feelings' come first when making decisions or creating processes. This is happening not only in companies but in our current society in general.


Perhaps I'm somewhat cynical, but I believe this is a feature of the human condition, not an attribute of our age in particular. Reason and analysis are tools that are used to justify what we already believe.


Agreed! The so-called "Age of Reason" was the anomaly, and probably not that much more reasonable than our own time.


I think there's absolutely a place for this. I often of the old Henry Ford quote about people wanting faster horses. Data and analytics are great for optimization, but sometimes you need to trust your gut and give people something they didn't ask for to have a breakthrough.


Customer pays data analytics vendor to tackle bunch of their [low quality, big size] data.

If you have no tangible capabilities to do above, asking customer "ARE YOU IN THE BIG DATA ONE PERCENT?" will be the quickest way out of the door.


I love DuckDB's simplicity and think it will solve many problems. Still, transitioning from a local single file DB to concurrent updates and serving it online will be different. I'm curious about what MotherDuck will come up with to solve DuckDB at scale.

I love use cases like the Rill Data (https://youtube.com/watch?v=XvP2-dJ4nVM), where you can suddenly run analytics with a single cmd line prompt and see your data just instantly visualized. Such use cases are only possible because of the "small" data approach that DuckDB tries.


Looks at 15 hr Spark job (running since this morning)

Sighs...


Confirmation bias exists almost everywhere. Confirmation bias especially among senior management is highly dangerous as decisions are based on not on data and facts, rather they are based on anecdotes, hunch/feelings, with high probability of going wrong. This is precisely where data scientists play a significant role, by providing recommendations and presenting facts based on hard data and mathematical models, in order to ensure that senior management decisions are based on facts/data, and not on anecdotes and hunches. Furthermore, a data driven organisation must have a supporting culture, where data driven decisions are given precedence, and data scientists (data messengers) must be empowered to present facts as is, no matter whether these facts are aligned or not with the basic assumptions and biases held by the senior management team. Creating such a supporting organization culture is extremely important but definite not easy. Culture is one of the factors that makes a difference between success or failure in a data driven organisation.


To the extent "Big Data" originally and is still often claimed to mean "data beyond what fits on a single [process/RAM/disk/etc]", it's always been strange to me how much it's identified with analytics pipelines doing largely trivial transformations producing ultra-expensive "BI" pablum.

Yes, thank goodness that part is dead. But meanwhile - we've still got more actual data than ever to store, and ever-tighter deadlines on finding and delivering it. If we can get back to that and let the PySpark bootcampers fade away, maybe things can get a little better for once.

In other words:

Even when querying giant tables, you rarely end up needing to process very much data. Modern analytical databases can do column projection to read only a subset of fields, and partition pruning to read only a narrow date range. They can often go even further with segment elimination to exploit locality in the data via clustering or automatic micro partitioning. Other tricks like computing over compressed data, projection, and predicate pushdown are ways that you can do less IO at query time. And less IO turns into less computation that needs to be done, which turns into lower costs and latency.

Big data is "dead" because data engineers (the programming ones, not the analysts-in-all-but-title) spent a ton of effort building DBs with new techniques that scale better than before, with other storage patterns than before. Someone still has to write and maintain those! And it would be even better if those tools and techniques could escape the half dozen major data cloud companies and be more directly accessible to the average small team.


"90% of queries processed less than 100 MB of data. [in big query]"

I think there is a problem when someone with such proclaimed knowledge of the sector gets to this, and similar, pieces of data, and does not attribute it to pricing. Could it be queries are short because bigquery pricing for analysis, as confusing as this models are, is based on amount of data?[0]

Because the other line of reasoning is that a big chunk of that 90% of professionals being paid to do their jobs, do NOT take into account pricing of the tool and are using it for small data, instead of thinking that people are using the best tool with the lowest price, because there's plenty of options to process and analyse data right now in the cloud.

On the "business have low amount of data", that matches my experience as well. At first I thought I was simply dealing with smaller sized companies, but it's a trend of doing big data projects for data that'd fit a pendrive.

[0] https://cloud.google.com/bigquery/pricing#analysis_pricing_m...


Sampling has proven extremely useful. Pi can be approximated with it as were nuclear bombs designed using statistical methods. Flame graphs based on stack samples are used to optimize servers. Government does planning with it. Management does its thing by wandering around.

It usually does not take many data points for an actionable insight and most actions then will invalidate small details in old data anyhow. Better to start every round with fresh eyes.


This entire post reads like "you probably don't actually have big data".

What do these blockchains do that have to keep data around forever, with high throughput, and need to expose it quickly do? Are you saying they should delete parts of data in the chain?

Seriously, I've spent my career working on big data systems, and while the answer is sometimes "yes you need to delete your data", I don't think that's going to always work.


And what about these blockchains? The full history of Bitcoin blockchain is less than 500gb, so for any analysis just getting a machine with a terabyte of RAM is both simpler and cheaper (once you include dev+ops time) than doing any horizontal scaling across multiple machines with "Big Data" approaches.

"You probably don't actually have big data" is a very valid point, not that many organizations do - most businesses haven't generated enough actionable data in their lifetime to need more than a single beefy machine without ever deleting data.


Bitcoin is notoriously slow. I don't think it's a good example of a high-throughput system. There are chains out there with 100x the number of transactions per second than that of Bitcoin. https://realtps.net/


Seems like just yesterday, every business magazine's cover story was about "big data." Wonder what the next batch of business buzz words will be?


Pretty funny to see this when every other headline on this site is about how large language models are about to revolutionize dentistry, beekeeping, etc.


Perhaps this is true for business data (though I'm skeptical of the claims), but, for example, for security data, this isn't true at all. Collecting cloud, identity, SaaS, and network logs/data can easily exceed hundreds of terabytes. A big reason why we're building Matano as a data lake for security.

It seems an odd pitch in general to say, hey my product specifically performs poorly on large datasets.


On the contrary, identifying what your product is explicitly not aiming to do is extremely helpful. "Big" adds a lot of complexity and pain, most people don't do that, our product avoids the complexity and pain and is the best choice for most people. Seems like a good, simple pitch, and all it requires is the humility to say that your solution isn't the best for some use cases.


Sounds like you're in the "Big Data One-Percenter" category described at the very bottom of the article.


To add to the “the real issue is…” pile:

Most orgs collect the data that is easy to collect, and they are extremely lucky if that happens to be the data that enables the insights they desire. When the data they really need looks too hard to get, the org tries to compensate by collecting more of the easy stuff, and hoping that if blood can’t be squeezed out of a stone, maybe it can be squeezed out of 100bn stones.


I remember the big data craze. People had very little data and low quality at that so they had a data problem before they had a big data one!


Yes! This!!!

Volume != Quality


> Customer data sizes followed a power-law distribution. The largest customer had double the storage of the next largest customer, the next largest customer had half of that, etc

I’m no statistician, but I’m like 99% sure that’s an exponential, not a power law

There’s a world of difference. The point of an exponential is that you can ignore big things. The point of a power law is that you can’t.


>> Customer data sizes followed a power-law distribution. The largest customer had double the storage of the next largest customer, the next largest customer had half of that, etc

> I’m no statistician, but I’m like 99% sure that’s an exponential, not a power law

I'm no expert either, but it seems correct. The power law distribution has each X value in an X/Y series decreasing by a specific factor, like this: https://en.wikipedia.org/wiki/Power_law#/media/File:Long_tai...

The exponential has each X value increasing by a specific factor, like this: https://en.wikipedia.org/wiki/Exponential_function#/media/Fi...


With all the LLM craziness, this is just the beginning. How else are they going to train all those models? I'm not an expert, just imho.


We need to re-think how to make data _useful_. The fact that the value hasn't materialized after decades of attempts, billions of dollars, and lots of tools and technology points to the fact that our core assumptions and patterns are wrong.

This post doesn't go far enough. It challenges the assumption that everyone's data is "big data" or that every company's data will eventually grow to be big data. I agree that "big data" was the wrong model. We also need to challenge that all data should be stored in one place (warehouse, lake, lakehouse). We need to challenge that one tool can be used for every data need. We need to challenge how we build systems both from a technology and people standpoint. We need to embrace that the problems and needs of companies _are always changing_.

We are living with conceptual inertia. Many of our patterns are an evolution from the 70's and 80's and the first relational databases. It's time to rethink how we "do data" from first principles.


The problem is that no tool alone can make data useful. It requires human ingenuity to come up with a theory, gather the required data, then test and verify the theory.

We've gotten to a point where the first and last step get skipped. Business leaders see other companies doing interesting things with data, so the answer must be "gather all the data"! Internal teams end up focused on gathering the data without the context of how it might be used.

We need to train data teams to not focus on the data as the product. Instead, they should be responsible for driving business actions. Gathering and cleaning the data should just a byproduct of that activity.


I'm quite surprised with data sizes mentioned in the article, and wondering if I'm missing something...We are a very small 2yo company, handling route optimization and delivery management / field service. Even with our very small number of customers, their relatively small sizes (e.g. number of "tasks" per day), being very early in development in terms of data that we collect - our database containing just customer data for 2 years is ~100GB. Which I previously considered small, and if we collected useful user metrics, had more elaborate analytics, location tracking history etc, I would expect it to be at least 3x.

We don't use any "BigData" products yet, as there wasn't any need for them, even when we provide full search and relatively nice and rich set of analytics over all the data. Yet, based on the article, we're way above most of the companies relying heavily on such tools. Confusing.


Another problem with "BigData": hiring and the tendency of the ecosystem to "sustain" itself (like any system). As a company hires traditional BigData Architects, Developers, Data Scientists, Engineers, etc. it will naturally have a tendency to choose the traditional BigData technology and solutions like BigQuery, Spark, storing everything in HDFS, etc.

A trick I saw is companies hiring experienced jack-of-all-trades back-end engineers into Data teams. A lot of things get migrated from Spark to Postgres, from Kafka to REST API calls, and keep working fine and become generally more responsive.

I'm on the same page as the author here: traditional BigData tech has its place and its uses, but before choosing it companies (CTOs, architects) should carefully consider if it is necessary, especially considering the cost of it and the risk of locking themselves down in a very specialized domain.


Tableau's "Medium Data" April Fools Day ad from several years ago still rings amazingly true.


So what I'm hearing is it's not the size of your data that matters, it's how you use it?


From about 2008/2009/2010 or so on there was perhaps an over-emphasis on specialized tools for the mass acquisition of streams of data. Maybe in large part due to the explosion of $$ in ad-tech. Some people had legitimately insane click/impression streams -- I worked at a couple companies like that. Development of DBs based on LSM trees or other write-specialized storage structures became important. Existing relational databases weren't particularly well built for this stuff. This was part of, but not the whole story with the whole NoSQL thing. People were willing to go completely denormalized in order to gain some advantage or ability here. It helped that much of the data looked at was of perhaps little structural complexity.

In the meantime SSD storage took off, so the IOPS from a stock drive have skyrocketed, business domains for large data sets have broadened beyond click/impression streams, and the challenge now is not "can I store all this data" it's "WTH do I do with it?"

Regardless of quantity of data, structuring and analysis and querying of said data remains paramount. The challenge for anybody working with data is to represent and extract knowledge. I remain convinced that logic -- first order logic and its offshoot in the relational model -- remains the best tool for reasoning about knowledge. Codd's prognostications on data from the 1970s are still profound.

I think we're in a space now where we can turn our attention to knowledge management, not just accumulating streams of unstructured data. The challenge in a business is to discover and capture the rules and relationship in data. SQL is an existing but poor tool for this, based on some of the concepts in the relational model but tossing them together in a relatively uncomposable and awkward way (though it remains better than the dogs breakfast of "NoSQL" alternatives that were tossed together for a while there.)

My employer is working in this space, I think they have a really good product: https://relational.ai/


My presentation from FOSDEM 2023 is very sympathetic to the "Big data is dead" statement: https://www.youtube.com/watch?v=JlcI2Vfz_uk

It is about using modern tools (ClickHouse) for data engineering without the fluff - when you can take whatever dataset or data stream and make what you need without the need for complex infrastructure.

Nevertheless, the statement "big data is dead" is short-sighted, and I don't entirely follow this opinion.

For example, here is one of ClickHouse's use-case:

> Main cluster is 110PB nvme storage, 100k+ cpu cores, 800TB ram. The uncompressed data size on the main cluster is 1EB.

And when you have this sort of data for realtime processing, no other technology can help you.


we regularly run audits on over 12 years of customer order histories. This requires scanning of about 40TB of data and growing. They used to jump through hoops on the Oracle cluster just to get data out for one customer. We pushed all of the order history into s3 parquet using Spark and I can query this in about 20 seconds using Spark or Presto. It's now streamed through kafka and Spark structured streaming so it's up to date in about 3 minutes. The click-bait-y title notwithstanding, I get that not all data is 'big' and duckdb (and datafusion, polars, etc) is probably great for certain use-cases but what I work on every day can't be done on a single machine.


I mean, you can almost fit all that data on one SSD. Micron's latest are 30TB. Commodity servers are available with 24 NVMe drive bays. At 7 GB/s read, across 24 drives, you can scan 40 TB in 240 seconds.

Such a server could readily fit 384 threads with dual EPYC and would be available with enough RAM to keep more than 10% of that data in cache.

Your workload absolutely fits the definition of "fits on one machine".

I am not suggesting here that you should put it on one machine. You definitely could, though.


To be honest, I slightly disagree about data size. I think the big data is there to be had, the real story is that data science itself has not panned out to provide the business value that people asserted would come from it. Data volumes haven't risen more because in the end, it turns out most of the things businesses need to know are easily ascertainable from much smaller data and their ability to action even these smaller very obvious things is already saturated.

It doesn't help that we've shifted into a climate where hoarding data comes with a huge regulatory and compliance price tag, not to mention risk. But if the value was there we would do it, so this is not the primary driver.


It's kinda weird to read this. The whole argument is "we didn't have databases that could handle the sizes and use cases emerging, we worked on the problem for 20 years and now it's no biggie".

Mission accomplished more than big data is dead IMHO.


Long live Big Model, I guess? Instead of independent data warehouses, we are now moving towards a few centralized companies using supercomputer in physical data centers. The "winner takes all" effect will only increase as the trend goes on.


Congratulations on the birth of little data, to the proud dad Big Data are in order.

Well, obviously, the realization that management is largely emotion driven and little data driven, is a prelude for the CEO-AI yet in the makings.

Of course this still got a face to it. A CEO who speaks and talks, as the voice commands, but does not do the part that even humans who think they are good at it, are bad at, decision making. The ground truth is there ("Cooperate history") going back to the merchants of sumeria. Lets learn that lesson, pack it into a decission tree, and wrap that bundle with Chat GPT smooth talking.


The database was the key technology in the 2001-2011 decade: it allowed companies to store massive amount of data in an organized way, so that they could provide basic functionality (search, monitoring) to users. Statistical learning is being the key "technology" of 2011-today: it allowed companies, which had stored massive amount of data, to feedback predictions to users. I think AR/Computer Graphics will be the key technology of the next decade: it will allow users to interact directly and seamlessly with the insights produced by ML systems, and possibly feed-back information.


nosql is dead, client side SPAs are dead. Nice to see the complexity pendulum swinging back to the correct side again. Curious what the merchants of complexity will reach for next. Are applets going to be the new hot thing?


Great post and really resonates with my experience. Good to have some confirmation that most organizations aren't using their large swaths of data.

Although I don't think most organizations are blaming lack of actionable insights on the data size. It's the lack of prioritizing data usage over data accessibility. We need to be teaching data people business levers and teaching business people data levers.

Data should be a byproduct of an actionable idea that you want to execute. It shouldn't exist until you have that experiment in mind.


Big Data lives on in LLMs.


Agree. And thing I noticed is that tools like #apache spark have become the de-facto standard for any data engineer work even when data size does not require it. Result is that many jobs are much harder to mantain and often slower (due to all the shuffling) than running on a single node.


Somewhere along the line people were tricked into thinking that logging was data, and that to we needed to turn up every trace log to 11 on every production system.

Logs are where data goes to die.


My personal definition of Big Data has always been when you gather/store data without having a planned use for it. Do we need this data? Don't know, let's just store it for now.

The article does allude to this definition when it states that "Most data is rarely queried". We have become data hoarders. Technology has made it easy (and relatively cheap) to store data, but the ideas of what to do with this data have not scaled in comparison.


"Among customers who were using the service heavily, the median data storage size was much less than 100 GB"

Eye-opening. Especially when combined with a recent quote from Satya Nadella, "First, as we saw customers accelerate their digital spend during the pandemic, we’re now seeing them optimize their digital spend to do more with less."

Conclusion: SaaS is easy to drop off in downturns. Just as easy as it is to buy initially.


Why would I use DuckDB instead of Clickhouse or similar? Is it just because I want to have the database embedded in my app and not connect to a server?


One great reason to use DuckDB was when ClickHouse took up too much memory on Parquet files.

https://github.com/ClickHouse/ClickHouse/issues/45741#issuec... helps with that though.

Also, clickhouse-local exists https://clickhouse.com/blog/extracting-converting-querying-l... as a thing.

But, yes, when I think of DuckDB...I think embedded use cases...i'm also not a power user.

I also think of this very much as a 'horses for courses' or 'different strokes, different folks' sort of scenario. There is, naturally, overlap because 'analytical data.' But also, there is naturally overlap with R and this giant scary mess of data-munging PERL code I maintain for a side project.

The DuckDB team, the MotherDuck team, the ClickHouse team...we all want your experience interacting with data to be amazing. In some scenarios, ClickHouse is better. In some scenarios, DuckDB. I'm biased (as I work for ClickHouse in DevRel), but I <3 ClickHouse.

Try both. Pick the one that is best for you. Then...you know...tell the other(s) why so that we all can get better at what we do.


Thanks but I’m looking for specific use cases. Like I get SQLite. And I get Clickhouse. But I just don’t get why I’d use DuckDB specifically. I’m sure it’s awesome and super useful but I have a gap in my understanding.


I agree with a lot of the sentiments of the MotherDuck people, but boy are they loud and proud for someone who never delivered anything more than blogposts and vague promise to somehow exploit the MIT licensed DuckDB.

Meanwhile for example boilingdata.com seems to have already done that - by using AWS Lambda + DuckDB as distributed compute engine which I can't decide if its awesome, deranged or both.


We (the DuckDB team) are very happy working together with MotherDuck in a close partnership [1].

[1] https://duckdblabs.com/news/2022/11/15/motherduck-partnershi...


Unlike quantum which cracks computationally complex algorithms, BigData was just about costs.

SSDs we’re limited in capacity and still expensive.

Parallelizing work with MapReduce allowed using cheap fault-prone commodity hardware and disks.

If you’re dealing with terabytes rather than petabytes of data, you probably don’t need BigData


"Very often when a data warehousing customer moves from an environment where they didn’t have separation of storage and compute into one where they do have it, their storage usage grows tremendously..."

Can someone explain why this is the case? Is it due to more replications or maintaining more indices?


Well then if businesses do not require data, then the "AI world" might need some. So changing practice to be a machine learning engineer might not seem too bad.


Well, we have less than 2 TB of data, and although we are running MySQL on a large instance with ~120 GB of RAM, it's extremely slow when dealing with big tables (like a 25 GB table) and that's why we need "big data" tools like BigQuery.


It's always kind of amazed me how closely Big Data was followed by the KonMari method and it really seems like the nerds were not paying attention to that at all. Or just happy to take a paycheck from people who weren't paying attention.

Hoarding is not a winning strategy.


I think server hardware solved the big data issue. The stuff we have now can blitz through data in the blink of an eye. For national governments like our own, mainframes still have a place. For me personally, I don't even talk about big data anymore.


People don't want to deal with having to rearchitect when their workload does not fit on a single instance. Yes, optimize for the small data case, but if you build a product that can handle only the small data case, you have a tough sell.


> The most surprising thing that I learned was that most of the people using “Big Query” don’t really have Big Data.

wow, ya think? Must have been eye opening to see all those customers with a few million rows thinking they had "Big data" huh?


It's not dead, it's just entered the plateau of productivity. Where people use it for whatever it's useful for and don't try to solve every problem with it just because it's the cool new thing.


On to the next hype theme(AI)!


Big data starts somewhere around a petabyte, maybe a bit lower than that. That's when you need some serious, dedicated systems. But as always everyone wants to (pretend to) do what the big players do.


First they came for the sacred microservices, now they are after Bid Data. What. Is. Happening.

Don't get me wrong, I love it. It's about time people got off these stupid and shockingly expensive bandwagons.


Similar to the "we must have microservices so that we can scale" fad a lot of people thought they had big data even though their records easily fit on a single machine.


We're going to reach a point where we might say the same thing about large language models. Fine tuned LMs (based off of their large parents) are going to be the bread and butter.


Agree with this post.

Big data was vendor generated hype that convinced many engineers to confuse the size of their dataset with their, ahem, shoe size.

They didn’t do their employers any favors.


Big data is dead because executives are rewarded when decisions are reactionary and politically savvy, data doesn't enter the picture.


This is a bit ironic given that generative AI models like GPT-3 and Dall-E only work because they were trained on very large datasets.


Main goal of Big Data as i see is to profile performance and metrics. Number of user registration, number of converted users,...


I quite often say: If you need KPIs you're too far removed from how the company actually conducts business.


So the argument is you can do everything with an OLAP Database because we shrunk "Big Data" back inside RAM?

K, good luck!


Medium data is where it's at! diskframe.com and polars and arrows are good enough for most use cases!


The truth is that most "big data" problems aren't big and can often be solved with awk and xargs.


What about IoT?


Big Data got replaced by Big Parameters.


Parameters come from data.


On the contrary. Now that AI is here, big data is going to be more alive than ever.


I think the relevant phrase here is 'Selling your book'


This is what most of the companies are doing.


Long live big data!


Long live Big Meaning


Listen to “Reason”


Interesting take from a Googler.

Big data hype never felt to me like anything more than a hype campaign to help big tech research ML/AI.

Larry even rambled as much: https://arstechnica.com/information-technology/2013/05/larry...

It appears not all Googlers got the memo.

Everyone else is in the way of him solving big problems! Not like such work could not be distributed among technologists and researchers around the globe via the internet. Help Google do it!

I am leaning into “Deep Work” going forward; will slowly iterate on my own model creation and collaborate with like minded folks. I’m fucking done with intentionally empowering billionaire minority who convinced an ignorant political gerontocracy that minority is capable of magic.

Anyone prattling on with common tropes of “longtermism”; nation state nutters, religious, technocrats; are appealing to non-existent authority they see a a magical future for us! Give them your money to insure it arrives! They have zero ability to insure such outcomes and a lot of upside to making people believe such today.


Good riddance.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: