Big data is dead

fdgsdfogijq · on Feb 7, 2023

"For more than a decade now, the fact that people have a hard time gaining actionable insights from their data has been blamed on its size."

The real issue is that business people usually ignore what the data says. Wading through data takes a huge amount of thought, which is in short supply. Data Scientists are commonly disregarded by VPs in large corporations, despite the claims about being "data driven". Most corporate decision making is highly political, the needs of/whats best for the business is just one parameter in a complex equation.

slt2021 · on Feb 7, 2023

I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

I did several experiments, and noticed that whenever I produced analysis that was in line with what management expected - my analysis was praised and widely disseminated. Nobody would even question data completeness, quality, whatever. They would pick some flashy metric like a percentage and run around with it.

Whenever my analysis contradicted - there was so much scrutiny in numbers, data quality, etc, and even after answering all questions and concerns - analysis would be tossed away as non-actionable/useless/etc.

if you want to succeed as a Data Scientist and be praised by management - you got to provide data analysis that supports managements ideas (however wrong or ineffective they might be).

Data Scientist's job is to launder management's intuition using quantitative methods :)

koolba · on Feb 7, 2023

> if you want to succeed as a Data Scientist and be praised by management - you got to provide data analysis that supports managements ideas (however wrong or ineffective they might be).

> Data Scientist's job is to launder management's intuition using quantitative methods :)

It’s no different than the days when grey bearded wisemen would read the stars and weave a tale about the great glory that awaits the king if he proceeds with whatever he already wants to do.

The beards might be a bit shorter or nonexistent, but the story hasn’t changed.

somenameforme · on Feb 8, 2023

If you [Croesus] go to war, a great empire will fall.

Guthur · on Feb 7, 2023

And the alternative is to use the data as bones, throw it up in the air and let it tell you what to do?

bronson · on Feb 7, 2023

Absolutely. If you don't like what K-Means is telling you, change a variable and re-run. (that's one great thing about business data: there's no shortage of variables! True, there's usually a shortage of independent variables, but fixing that is difficult and underfunded).

red-iron-pine · on Feb 7, 2023

And you'd better hope the bones actually say something useful.

I was the infra lead on a data lake project and got take part all the way to breaking down the data and turning into PowerBI reports. The result was "sell more" and to clients who marketing already identified, years ago, as whales.

There were some interesting other insights, esp. w/r/t to niche products that sold around weird dates (Easter, Memorial Day, 4th July -- but not obvious gift days like Valentines or X-Mas), but it led to a lot of "you're doing it wrong!" recriminations and follow up projects.

peatmoss · on Feb 7, 2023

> Data Scientist's job is to launder management's intuition using quantitative methods

Ouch. This is savage, but sadly correct in many cases.

HOWEVER, to play devil's advocate here, I've also seen corporate data scientists overstate the conclusions / generalizability of their analysis. I've also seen data scientists fall prey to believing that their analysis proves would should be done, rather than what is likely to happen.

The role of an executive or decision maker is to apply a normative lens to problems. The role of the data scientist / economist / whatever is to reduce the uncertainty that an action will have the desired effect.

antipaul · on Feb 8, 2023

Yep. At this point, I essentially don't trust any ML result that shows > 95% accuracy.

So often, those models proved to be over-fitted and not generalizable.

But too many decision makers simply can't properly judge such results.

mmsimanga · on Feb 8, 2023

Good point. Data is one aspect of making a decision. The other aspect is understanding the industry and environment. Often data scientists give just one variable needed to make a decision. In health care for example you need to factor in a whole host of legislation. You also need to factor aspects of the industry not reflected in the data. As an example doctors not wanting to use iPads is something you can't measure and can't force as company. Even though data analysis might suggest this is the way to go.

derefr · on Feb 8, 2023

> The role of an executive or decision maker is to apply a normative lens to problems. The role of the data scientist / economist / whatever is to reduce the uncertainty that an action will have the desired effect.

Where do business analysts fit into this dichotomy? Their whole job is to poke around in Tableau in order to surface high-ROI strategies for the business to pursue. (Where, in choosing which proposals to surface to management, they're effectively making 90% of the strategic decisions.)

Or how about corporate buyers in trading and retail companies?

Or quantitative investment managers?

peatmoss · on Feb 8, 2023

People who poke around in Tableau might not get a lot of respect in the hierarchy of DataFolk, but descriptive statistics and thoughtfully chosen visualizations can be immensely useful. Exploratory data analysis sometimes reveals patterns that are so obvious that to apply statistical inference is just vanity.

If understanding the data generating processes is the goal, I'd rather see some useful plots than wade through a technical description of some model whose assumptions were flagrantly violated.

selestify · on Feb 8, 2023

What does a “normative” lens mean?

astine · on Feb 8, 2023

As opposed to "postive". It's the old is-ought dichotomy https://en.m.wikipedia.org/wiki/Is%E2%80%93ought_problem.

Positive claims are about what is true. Normative claims are about what should be true, or rather what decisions we should make. Put another way, positive claims deal only with facts while normative claims deal also with values.

GP is saying that it's the data-scientist's job to give the executive the facts and it's the executive's job to decide what to do about the facts.

pjmorris · on Feb 7, 2023

> I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

So, a synonym for 'consultant?' :)

happymellon · on Feb 7, 2023

My experience with consultants normally ended up with them asking why they are there and what report should they present to upper management.

I've always used them as "independent 3rd parties" who were listened to.

I_complete_me · on Feb 7, 2023

Here's my take on this not listening to the "expert":

A few years ago there was a problem with storm-water infiltration into my (elderly at the time) mother's property from her neighbor. I, being a dutiful son and a civil engineer, investigated it and came up with the probable cause, the likely effects of non-action and the most cost-effective solution. I presented it to my mother in the most layman-like terms that I could. She said she'd think about it – meaning she'd refer i.e. defer – to her daughters. In the meantime I had a very layman-like chat with my mother's carer and told her the situation in layman's terms. The carer listened and said that what I said I made total sense to her. Later on, one of my sisters accosted me and stated that it was completely obvious what the problem and the solution was – "even the carer could see it". Human foreheads don't have the real estate for where my eyebrows wanted to ascend.

My advice is to consider whether the message should be separated from the messenger somehow.

EricE · on Feb 7, 2023

"A prophet is not welcome in his hometown"

This has been going on for mellennia, unfortunately.

jacobn · on Feb 7, 2023

This sounds a lot like how my kids will listen to a teacher/coach, but not their parents...

rxhernandez · on Feb 7, 2023

Which is similar to how a lot of parents won't listen to their kids but will listen to the coach, teacher, or priest.

mcphage · on Feb 10, 2023

My Mother-in-Law was called by a tech support scammer. Her bank was unwilling to accept their charges, and the scammer wanted her to call the bank to tell them to accept them anyway. My Brother-in-Law was telling her "no, this is a scam, do not do this", but she was unwilling to listen. Eventually, he told her to call me, thinking if she wouldn't listen to her son, maybe she'd listen to her son-in-law. Which she did.

makeitdouble · on Feb 8, 2023

Parents can be listening to their kids 99% of the time it will be transparent and uneventful. When there’s confrontation/divergence in opinion, by definition it didn’t work out through the usual channel, and of course a third party weighting in the balance will have visible effects.

neuronic · on Feb 7, 2023

As a consultant with roots in backend dev, I fully understand the scrutiny that we receive because unfortunately, it is often very warranted... It feels a bit refreshing to read your comment and see someone articulate what I am trying to convey to my clients. I am a tool, and yes, this pun is intended.

lelanthran · on Feb 8, 2023

Hello

I want to make the move from development into consultancy and would appreciate hearing how you did it.

I cannot DM you, but if you have the time to type a small paragraph, I'm all ears.

If you don't want to make that public, you can directly email me (<HN-username>@gmail.com)

neuronic · on Feb 8, 2023

Sure, it is actually not very complicated in my case. I did backend development for a short while during and after university and then moved into IT consulting fairly quickly.

It was a LinkedIn recruiter message which I usually ignore. However, my SO did not (she is in IT as well) and convinced me to join a hiring event. I ended up liking it a lot and went through the hiring process. Soon, I started out on the most junior level and joined my first project with 3 very senior colleagues after a few weeks.

The learning curve was very steep both on the technical level and also regarding the consulting aspect - at first there was nothing I could 'consult' on due to lack of experience. This changed with growing experience, with the guidance of senior colleagues and my private efforts to gain skills and expertise.

Let me know if this was helpful!

zigman1 · on Feb 8, 2023

This almost reads like my trajectory so far, but I'm at the point where I can't really consult due to the lack of experience, but I did make a good impression so far. Can I ask you, into what efforts should I put my private time? More technical knowledge? Into very fine details, or brief insights into different areas? Any good resources?

Thanks a lot!

didgetmaster · on Feb 7, 2023

In the news business, if your story or opinion backs up the preconceived notions of the investigative reporter then you are a 'source' otherwise you are a 'conspiracy theorist'.

htrp · on Feb 7, 2023

All symptoms of the same problem..... you can hire McKinsey to confirm your priors, massage the data to confirm your priors, or anything in between.

stronglikedan · on Feb 7, 2023

A consultant with the data to back up their claims!

strbean · on Feb 7, 2023

If Data Scientists are essentially in-house management consultants, I wonder which is cheaper?

slt2021 · on Feb 7, 2023

This could be a reason why Data Scientist as a job title exploded in last years, every middle manager could afford one/two/few headcounts of data scientists to produce analysis that advances that middle manager's corporate agenda (more growth, empire building, expansion to certain de-novo areas, etc).

Recent tech layoffs is the other side of that growth, when cheap money is gone and company is forced to stick to core competencies and shutdown growth plans

eska · on Feb 7, 2023

This would match what psychologists say about humans in general: we feel first, then we use our brain to justify that feeling. We’re not rational beings.

diognesofsinope · on Feb 7, 2023

I think the answer is simpler: people care about their careers and their family first. Think, "If the data says something that gets in the way of my career well I don't care about the data."

Had the same problem when I was an economics researcher -- publication bias for what stakeholders want to hear (often the government) is rampant because that's where funding for the economics department mostly comes from.

inadequatespace · on Feb 19, 2023

It's only rational. The company certainly doesn't care about that individual first, as evidenced by e.g. its decision to lay them off when it doesn't think the individual is serving them, so why should the individual put the company first?

This is also known as The Iron Law of Institutions.

AnIdiotOnTheNet · on Feb 7, 2023

We totally are, it's just that rationality is a tool, not a guide. If you want to work out the truth, rationality will help you do that, but if instead you want to justify a decision you already made, well, it'll help you do that too.

galangalalgol · on Feb 7, 2023

Hypothesis don't come from rationality either, they result from well informed intuition. All of the formality of science is about tricking ourselves into discovering our intuition is wrong using a rational series of steps even when everything in our nature is to use that ability to reason to do the opposite.

Karellen · on Feb 7, 2023

"Man is not a rational animal; he is a rationalizing animal."

-- Robert A. Heinlein, Tunnel in the Sky, 1955

ArjenM · on Feb 7, 2023

A whole industry of emotional branding is thriving, systematically overloading our brains so it hurt to even think differently in the moment.

We are accepting a whole lot of assumptions every day.

trieste92 · on Feb 7, 2023

or is it that psychologists feel that we aren't rational and use reason to justify this?

it isn't clear to me how the grounds for realizaing the theory are reconciliable with its conclusion

_2uwr · on Feb 7, 2023

Thats because psychologists dont understand or choose to ignore how chemistry influences our personalities and emotions. An extremely simple example from the same medical/health profession is the use of SSRI's to make people feel happy. The legal system recognises how chemicals influence our feelings because of the laws that exists on illegal drugs or drink driving.

The definition of rational is being informed enough to know what said chemicals will do in the short term and long term in order to make an informed decision, but then I'm reminded we dont get taught any of the above unless we specialise at a Uni, so most people cant make any sort of informed decision.

h0p3 · on Feb 12, 2023

I'd like to know more about "[Your] life under [your] state employed 'parents' and at the hands of other employees of the state"

steveBK123 · on Feb 7, 2023

Yes. I worked in the data org of a moderately sized financial firms tech org. The tech org claimed to be hugely data driven. Was in the org mottos and all of that.

Nonetheless, the CTO went on a multi-year, 10s of millions of dollars, huge data tech stack & staffing reorg shake up... with really zero data points explaining the driver, or what we would measure to determine it was successful.

So it became a self referential decision that we are successful by doing what he decided, and we are doing it because he decided it.

zeagle · on Feb 7, 2023

Huh. I've not thought of it as laundering, but I think you've basically summarised consulting in healthcare. Pay to legitimize and push through a pre-existing idea (eg let's close down a few ERs) or a delusion (e.g. lean, we don't need a waiting room) and say it was recommended by consultants to stakeholders and the public.

slt2021 · on Feb 7, 2023

all consulting is like that, Partners/MDs at consulting companies meet with Board/CEOs to get rough idea of what they want/need, and quickly negotiates a consulting engagement contract to create PowerPoint with all the evidence and analysis gathered that supports CEO's initial idea.

This is the only reason why a 60+ PowerPoint slide deck can cost several millions dollars

saltcured · on Feb 7, 2023

Right, the more appropriate analogy is "parallel construction"...

monero-xmr · on Feb 7, 2023

Also a big reason McKinsey and BCG exists - provide cover for business plans intended by management to protect them from shareholder lawsuits. My friend did a sojourn at McKinsey and 6 months of his life was producing PowerPoints and memos backing up an expansion to AIPAC region. Was already happening but he was providing all manner of business justification for board meetings and whatnot.

out_of_step · on Feb 7, 2023

This phenomenon is true to varying degrees in academic medicine (maybe all of academia) as well - personally have seen excellent data and methods disregarded when they don't confirm existing agendas. The choice for the researcher can become one of burning out trying to do good work and getting nowhere, or acquiesce and only present data that is uncontroversial. Huge existential threat to knowledge advancement.

karmajunkie · on Feb 7, 2023

"Data launderer" would be a good job title...

wolf550e · on Feb 7, 2023

The data is not laundered. Preconceived ideas and biases are laundered and given scientific sounding justifications.

e12e · on Feb 7, 2023

Concept Confirmer, Bias Booster?

barbecue_sauce · on Feb 7, 2023

Context Provider.

e12e · on Feb 7, 2023

Affirmation Artificier?

MonkeyClub · on Feb 7, 2023

Assessment Assurance, for double the bang.

BLKNSLVR · on Feb 8, 2023

Cherry Picker

MonkeyClub · on Feb 8, 2023

Good role for all Perky Cheekers, surely!

yMEyUyNE1 · on Feb 8, 2023

Data Torturer? <https://en.wiktionary.org/wiki/if_you_torture_the_data_long_...>

ryanwaggoner · on Feb 7, 2023

Decision laundering. Take in dirty decisions and produce clean ones.

hermitcrab · on Feb 7, 2023

Most data is very dirty.

Phrenzy · on Feb 7, 2023

That would imply the data is clean when they are finished. GIGO

SkyBelow · on Feb 7, 2023

This isn't just "Data Scientist" but scientist as well. The more a finding is in contradiction, either with existing scientific consensus or even with just popular culture, the more the science is criticized. I've seen unequal criticism based on how much people wanted the results to be true/false and even after responding to the criticism I've seen people just ignore science they don't like.

The skepticism isn't a problem, the unequal application of it, the potential to harm careers, and the chilling effect as people wisen to how best meet their own personal goals is.

User23 · on Feb 8, 2023

> I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

The SNAFU principle: communication is only possible between equals. When an hierarchical divide exists the subordinate will tell the superior what he wants to hear.

nobody9999 · on Feb 8, 2023

>The SNAFU principle: communication is only possible between equals. When an hierarchical divide exists the subordinate will tell the superior what he wants to hear.

Sadly true. As humorously depicted here[0]:

   In the beginning was the DEMO Project. And the Project was
   without form. And darkness was upon the staff members 
   thereof. So they spake unto their Division Head, saying, "It 
   is a crock of shit, and it stinks."

   And the Division Head spake unto his Department Head, 
   saying, "It is a crock of excrement and none may abide the 
   odor thereof." Now, the Department Head spake unto his 
   Directorate Head, saying, "It is a container of excrement, 
   and is very strong, such that none may abide before it." And
   it came to pass that the Directorate Head spake unto
   the Assistant Technical Director, saying, "It is a vessel of
   fertilizer and none may abide by its strength."

   And the assistant Technical Director spake thus unto the
   Technical Director, saying, "It containeth that which aids 
   growth and it is very strong." And, Lo, the Technical 
   Director spake then unto the Captain, saying, "The powerful 
   new Project will help promote the growth of the 
   laboratories."

   And the Captain looked down upon the Project, and He saw 
   that it was Good!

[0] https://www.anvari.org/fortune/Best_Fortunes/820_in-the-begi...

chongli · on Feb 7, 2023

Same goes for economists and the politicians who sponsor them, just as it did for the astrologers and their patron kings.

BLKNSLVR · on Feb 8, 2023

Economics is the go-to for conservative, status-quo maintaining arguments because there's a wealth of statistics and information available for "how things have always been", and precious little-to-none for "how things could be if..."

It's easier to poke holes in predictions of the future than in interpretations of the past, especially when the people making those decisions have likely reached their decision-making status through "how things have always been".

remus · on Feb 7, 2023

I think this depends a lot on the org. In a place I used to work we collected and analysed a lot of data which convinced management to significantly change the spec of the product and spend a lot more time and effort on testing, because the product was being used in unexpected ways.

I would say it was a very engineering driven org however, so if you could present compelling data it could go a long way.

HybridCurve · on Feb 8, 2023

Authoritarian types consider any information derived by science which is contrary to their position as invalid or irrelevant because facts challenge their authority and ability to exercise control.

fatneckbeard · on Feb 8, 2023

yes. I used to think the Church had a honest disagreement with Galileo about heliocentricity. When I grew up I realized the Church never cared about orbits at all, what they care about is maintenance of status quo.

And then when I got old, I realized, there is even a reason that some people want status quo... because they have usually been around long enough to see society fall apart into anarchy and mass murder, so in their mind, they are doing the right thing.

defrost · on Feb 8, 2023

"The Church" wasn't then and isn't now a monolith of opinion.

A modern characterisation of "The Galileo Affair" would be that he was SWAT'ed by someone he was really really mean online to.

    Thus the whole "Galileo affair" starts as a conflict initiated by a secular Aristotelian philosopher, who, unable to silence Galileo by philosophical arguments, uses religion to achieve his aim.  [1]

and

    While delle Colombe was almost alone in arguing publicly against Galileo, there was a group of scholars and churchmen who supported his Aristotelian views. After Galileo referred disparagingly to delle Colombe as 'pippione' ('pigeon'), his close friend the painter Lodovico Cigoli coined the nickname 'Lega del Pippione' ('The Pigeon League') for delle Colombe's group.[2]

Galileo literally refered to delle Colombe (and friends) as Simplicio (simple minded) and worse in his highly popular Dialogue Concerning the Two Chief World Systems [3] and within a year or so the Pigeon League got their revenge, using their influence to have religuous charges bought against Galileo.

The affair was complex since very early on Pope Urban VIII had been a patron to Galileo and had given him permission to publish on the Copernican theory .. this was very much a case of personal vendettas and internal politics rather than a straight up case of "The Church Versus Galileo".

[1] https://en.wikipedia.org/wiki/Galileo_affair#cite_note-Spell...

[2] https://en.wikipedia.org/wiki/Lodovico_delle_Colombe

[3] https://en.wikipedia.org/wiki/Dialogue_Concerning_the_Two_Ch...

andrewprock · on Feb 8, 2023

The church was not upset about heliocentrism. They were upset that Galileo was attempting to reinterpret the words of Bible in order to bolster his astronomical authority.

andrewprock · on Feb 8, 2023

"[Therefore,] when God willed that at Joshua’s command the whole system of the world should rest and should remain for many hours in the same state, it sufficed to make the sun stand still. In this manner, by the stopping of the sun, the day could be lengthened on earth—which agrees exquisitely with the literal sense of the sacred text.”

- Galileo Galilei

icelancer · on Feb 8, 2023

>>I did several experiments, and noticed that whenever I produced analysis that was in line with what management expected - my analysis was praised and widely disseminated. Nobody would even question data completeness, quality, whatever. They would pick some flashy metric like a percentage and run around with it.

>> Whenever my analysis contradicted - there was so much scrutiny in numbers, data quality, etc, and even after answering all questions and concerns - analysis would be tossed away as non-actionable/useless/etc.

It's a good sign at the company that I run, anytime our analysts/data scientists come up with metrics that say we're killing it, or that our ideas should bear a ton of fruit, the kneejerk reaction is to be extremely skeptical of the results. Usually they're still right.

When the data scientists say we're fucking something up, we tend to pay a lot more attention.

Only the paranoid survive, after all.

disqard · on Feb 7, 2023

An interesting essay that echoes these same sentiments:

https://ryxcommar.com/2022/11/27/goodbye-data-science/

adasdasdas · on Feb 7, 2023

Oh the experiment didn't go as expected? Rerun 5 more times with minor tweaks. It definitely not p-hacking ;).

data-ottawa · on Feb 7, 2023

I’ve been there, we wanted to release a feature, it kept coming back with issues that made it perform much worse than control, after 5 or so iterations with bug fixes it came back positive.

It took a lot of analysis and time to clarify to higher ups that we weren’t just P-hacking , but at least they were concerned about that.

tpoacher · on Feb 7, 2023

sadly same with academia and funding sources

prometheus76 · on Feb 7, 2023

I agree with you. I call Data Scientists "soothsayers for the Pharoah".

sidpatil · on Feb 7, 2023

> Data Scientist's job is to launder management's intuition using quantitative methods :)

https://www.youtube.com/watch?v=kAichhoZrKs

m463 · on Feb 7, 2023

I wonder what a data scientist could really find out about executive (over?) compensation. employee compensation. working from home. office cubicle size and layout. tool expenditure for employees vs productivity.

ProjectArcturis · on Feb 7, 2023

How would you measure productivity at scale?

anthonygd · on Feb 8, 2023

Well, obviously you ask the CEO.

sva_ · on Feb 7, 2023

> I used to joke that Data Scientists exist not to uncover insights or provide analysis, but merely to provide factoids that confirm senior management's prior beliefs.

Also someone to blame if it doesn't work out

timbaboon · on Feb 7, 2023

As a data scientist at a large corporate I find this is often the push… but I resist every time and tell people what they don’t want to hear. Maybe I’m playing this whole corporate ladder thing wrong :/

dredmorbius · on Feb 8, 2023

"Data Scientists exist not to uncover insights ..."

That was a plot point in Dirk Gentley's Holistic Detective Agency (1989), though the observation much pre-dates this.

dylan604 · on Feb 7, 2023

how is this really different from any other aspect of life? Very few people really like to be told counter information, and it is always easier when providing data that aligns with the current group think. Doesn't matter if it is business, politics, or really anything. Being the outlier trying to change the direction of things is a struggle.

dcl · on Feb 8, 2023

110% my experience working in data science.

I found it incredibly stressful to discover and provide analyses (even experimental results) that wasn't expected, or contradicted prior beliefs. The findings were always very harshly scrutinized, and typically lead to tons of pointless extra work to 'understand what is going on'.

dukeofdoom · on Feb 7, 2023

That's like a portrait artist that finds success by painting people more beautiful than they really are vs a starving one that paints them true to life due to sense of artistic integrity. Reminds me how Garth Brooks started doing metal after becoming a country music star.

rahoulb · on Feb 8, 2023

Not just data scientists.

My friend runs a successful market research agency and she says she gets called in when management have decided they need to make a change but need evidence to sell it to the shareholders and staff.

stevenally · on Feb 7, 2023

So Big Data isn’t dead. It’s just found it’s place in society….

throwaway15908 · on Feb 7, 2023

Sounds like a good way to weed out middle management ;)

terry_y_comb · on Feb 7, 2023

Indeed, confirmation bias happens to almost everyone.

deterministic · on Feb 8, 2023

Bingo. Welcome to the world of office politics!

xkcd1963 · on Feb 7, 2023

Reminds me of the book "Bullshit Jobs"

whoevercares · on Feb 7, 2023

True - aka “are right a lot”

TEP_Kim_Il_Sung · on Feb 8, 2023

Government should do it, that way we can be sure it is honest and correct.

alar44 · on Feb 7, 2023

I mean, that makes sense does it not? If you're confirming something people already had a hunch about, why would they challenge it? And if it does go against their belief, they are going to want to make sure the data is correct before they change the course of the ship.

hn_throwaway_99 · on Feb 7, 2023

Agree with some of what you've said, but disagree with a lot:

> Most corporate decision making is highly political, the needs of/whats best for the business is just one parameter in a complex equation.

100% Individual humans are emotional creatures with their own wants and needs, and it's important to understand how organizational incentives drive decision making.

> Data Scientists are commonly disregarded by VPs in large corporations, despite the claims about being "data driven".

This has not been my experience, though. The more common thing I've seen is that, sometimes data is boring and doesn't really show much actionable insight, but as everyone wants to justify their job, I've seen data scientists come up with really questionable conclusions that fell apart on further inspection (call it "p-hacking the enterprise").

Plus, a lot of this data in these data wearhouses is messy. Often times data scientists are siloed at the end of the process, but then you get "garbage in/garbage out" results, where there is some bug in data tracking that isn't understood until it's too late. Much better in my opinion to have data engineers and data scientists work much more closely with product engineering teams up front so they can help ensure the data they collect is accurate.

agentofoblivion · on Feb 8, 2023

What “data driven” normally looks like: business people ask questions that are not answerable, but the DS team often goes off and tries to answer them anyway. I’ve been in so many meetings, in multiple organizations, where a senior leader asks a question like, “why did this number go up? We need an owner to deep dive this.” Then, someone disappears for a few hours or days and comes back with some nice narrative that more or less might make some sense, and is directionally consistent. Then, best case, they go “hmm, ok” and worst case, they say it’s an interesting view and we need to add it to the tracking. Then, move on to a new question + action item, rinse and repeat. And then, 6 months or a year later, some brave soul goes, “geeze, there are way too many things to look at here, can we streamline this reporting to focus on the handful of most important metrics, and then assign owners to review the others and surface anything interesting?” And then all those things people asked for over time get stuffed into an appendix or deleted, and the whole thing starts all over again.

tracerbulletx · on Feb 7, 2023

I find a lot of organizations don't have the discipline to harness whatever power their data may have. Sure collect everything, but god forbid you have any sort of data governance, or spend a single resource minute of time manually tagging or organizing or validating it. Then they try to make shitty ML models or products out of it, but don't care if the models actually work or not, just that they have AI now. Then a year later when the model has provided no value they are like, oh well big data is worthless I guess.

hef19898 · on Feb 7, 2023

Palantir, you have to have Palantir.

Oh, and a bunch of data scientists with zero domain knowledge for whatever data they are analyzing, preferably with PhDs in maths, but some ML background will do. And agile, because of course all those Palantir dashboards can only be developed using agile.

Once all is said and done, zero insight was created but a whole lot of consultants, contractors and project managers have been paid handsomely, while some higher ups can now put "implemented agile and big data at X" on their LinkedIn profiles.

vjk800 · on Feb 8, 2023

I'm one of those PhDs with zero domain knowledge analyzing data and I share the sentiment.

Most of my analyses provide very little value because they are sort of common sense to people with domain knowledge. When I ask people what could be more useful, one of two things usually happen: 1) it's impossible due to data and/or infrastructure limitations, 2) what they ask turns out to be nonsensical in further analysis (like asking for average of something that follows a very fat tailed distribution with a few observations dominating the phenomenon. Of course it's usually impossible to explain this to people).

The more I think about this, the more I think that in truly data powered companies, both the decision making and data analysis have to be carried out by more or less the same people. The organizational hierarchies have to be much flatter. Essentially the employees will have to be some kind of "secret agents" who have both the skills and the mandate to steer the company in the direction they see fit. I sort of see this already happening in the FAANG companies where, or so I hear, it's very difficult to get hired, the staff count is quite small compared to traditional companies and the senior engineers have a lot of power in the company.

Using math PhDs or Palantir or whatever as a sort of modularized black box for "insights" while giving them no real skin-in-the-game does not work.

hef19898 · on Feb 8, 2023

I can confirm this way of worling with data from my time at Amazon operations. We used data all the time, everywhere and for everything. But we did not have data scientist in our time, we did it ourselves. Quite peculiar, but so damn effecient and effective. I kind of miss that. It also showed that most of the data analytics stuff I know, Six Sigma, is just plain overkill for a lot of practical applications.

My favorite example is the WW2 bomber diagram shown to illustrate survivorship bias. Sure, working from data and first principle one could identify the vulnerable spots of the bomber. Or one could asl the designers or have an engineer, heck even a contemporary field mechanic, take a look at the actual drawings of the plane. And reach the same conclusion, faster, with a ton of additional insight and improvement ideas that can actually be implemented...

mritchie712 · on Feb 7, 2023

Size isn't the real problem, it's time.

Are you going to take the time / money to set up a warehouse, get all the data into with an ETL product, set up dbt or some other transformation layer, set up a BI tool and build the reports and dashboards, etc.

Regardless the size of your data, you still need to get it in one place and model it in a way it's actually usable.

didgetmaster · on Feb 7, 2023

Exactly. It isn't just time to set up all the data in a way that makes the right query possible. It is also having queries fast enough to be able to run a vast number of them in order to find what you are looking for (or even things you were not looking for).

https://didgets.substack.com/p/data-science-and-serendipity

stainforth · on Feb 8, 2023

Is it queries on live data or data thats been moved usually?

fijiaarone · on Feb 8, 2023

It’s moving the data around that is slow… and expensive. Getting the data into the data warehouse, then getting it to the processors then moving it around to filter and transform.

Getting your data to the cloud is expensive, but then you can’t do anything with it because distributing it to process in multiple stages is too expensive and you’re already paying so much to keep all that useless data.

olyjohn · on Feb 7, 2023

Can't we just give it to that one IT guy down in the basement?

mritchie712 · on Feb 7, 2023

Hey, I used to be that guy (and still am).

snapetom · on Feb 7, 2023

Agree on the size not being the issue. I transitioned from a data engineering manager to a data product manager at a new company. You know how much data a typical customer generates? Less than 1TB a year.

I told my VP that the engineering foul ups in the current product are easily fixable. Standard tooling and patterns exist to re-architect and solve the bottlenecks. What is much harder is a data architect to make sense of the complex data and make sure there is good value for our customers.

Guess what position I don't have on the team, and won't have due to budget issues.

capableweb · on Feb 7, 2023

Is that what's happening at Amazon as well? As they seemingly is loosing more and more track of the "Customer Obsession" schtick.

fuzzylightbulb · on Feb 7, 2023

"customer obsession" was always at the mercy of the real obsession: "making money hand over fist". The former will ALWAYS lose out to the latter given enough cycles.

pphysch · on Feb 7, 2023

Yeah, "customer obsession" really just means "market share / growth obsession" which is a means to (eventually) making monopoly profits. Which Amazon seems to have achieved.

LarryMullins · on Feb 7, 2023

All the amazon corporate values are derived from making profit. Two-pizza teams? More like "three slices isn't frugal", one or two should be enough for you.

ethbr0 · on Feb 7, 2023

One may follow the other, but not vice versus.

It's a pretty strong argument to say that Microsoft under Gates was technically obsessed, but that really faltered under Balmer.

Microsoft continued to win profits, but they made major strategic missteps that cost the revenue.

Amazon feels like it's going down the same path: empowering the tree-gazers without remembering that the forest also matters.

bombcar · on Feb 8, 2023

You can also “coast upwards” for quite awhile - as an example we now pay Microsoft less than we were for periodic upgrades to Office for the entire Microsoft 365 suite (including email hosting, etc) but all the machines are now Macs. They make more from us in one way, but less in total dollars.

ethbr0 · on Feb 8, 2023

Exactly. I can't imagine some accountant hasn't come up with a way to quantify this.

{Revenue attributable to previous R&D} (aka coast) vs {Revenue attributable to current R&D} (aka acceleration)

aintgonnatakeit · on Feb 7, 2023

They are encouraging their customers to have a bias toward action. Away from that asshole Bezos.

bluedays · on Feb 7, 2023

Their mission hasn't changed. They're still obsessed with customers, just not the way you think.

revolvingocelot · on Feb 7, 2023

It's absolutely this. "Decision-based evidence-making" is what I've seen it called.

hgsgm · on Feb 7, 2023

"We make decisions based on data, so let's use mine."

midasuni · on Feb 7, 2023

Take random words that Brent Spiner has said and mash them into a video.

boh · on Feb 7, 2023

The other thing to consider is that data simply has nothing of value. Part of the marketing of big data is the almost fairy tale belief in "insights" existing in any data set if you just look hard enough.

ryguyrg · on Feb 7, 2023

Correct, much data has no value. The cost for storing the data, maintaining the data [in the day-and-age of privacy requirements especially], and combing through the data is often much greater than the value obtained from the data itself.

The expertise we need in the industry is people who understand applications in-and-out and make great decisions on what data is worth keeping for present and future applications. And what data is needed to be kept, but only in aggregates (or anonymized, which reduces costs of maintenance)

antisthenes · on Feb 8, 2023

Wait, are you telling me GIS data of deer poop distribution in my backyard has no practical value?

But I have thousands of data points, thousands, I tell ya!

ren_engineer · on Feb 7, 2023

"data" is usually just twisted to make leadership look good or justify what decision they wanted to make anyway. Analytics data is sliced at arbitrary time periods to make growth in whatever metric look good, certain subsets are just removed, etc.

doesn't help that most of this data goes through multiple layers of BS where each person is putting it through filters to make themselves look better. And a good chunk of people don't have enough understanding of stats to understand when they are being tricked

RA_Fisher · on Feb 8, 2023

I think the recent industry layoffs reflect that in part.

Though I think there’s a better way— that is executive data science, what I do at Zapier. The key is that I’ve built up a huge amount of econometric and economic / business skills that I apply to affect good change in collaboration with company leaders. It allows me to work with Execs using sensible analysis. I improve growth and output by helping us catch errors of assumption before they go into production and cost growth / bad surprises. I also help the executives gain alignment around good information. This multiplies their departments’ output by allowing them to work better together, more in concert. That helps avoid issues with data being bent to decisions.

They typically carry a lot of hard-won valuable domain knowledge (that I combine with my economic-statistic knowledge and skills for rigor).

It’s my job to ensure Execs start with good sensible information regarding objectives. They usually ask fantastic questions and share a lot of great analysis of their own.

There are times I learn about what I might call controversial implications. This is typical of innovation using technology. It’s in these moments that I feel I create the most value by highlighting the trade offs I believe we face / potential regret.

bigbluedots · on Feb 8, 2023

That sounds amazing, what does your day to day consist of?

RA_Fisher · on Feb 8, 2023

I get to apply my causal inference and measurement skills to a lot of compelling situations in collaboration with some really intelligent and skilled people. I truly feel like I get to perform technically excellent work, and I feel very privileged in that way.

My day to day is mostly economic / econometric analysis, conceptualization and formulation of technology, and writing, but also occasional meetings to gather information, feedback, answer questions, teach and collaborate with others.

I try to effectively “sample” the org by interacting with people up and down the org, so I can assist Execs in incorporating directors, managers and ICs knowledge / pool good information / systematize planning, etc.

Simon_O_Rourke · on Feb 8, 2023

> Data Scientists are commonly disregarded by VPs in large corporations, despite the claims about being "data driven".

I worked in a "data-driven" company of 5,000+ employees, with 4 data scientists who were spilt into two separate non-collaborating teams. In effect, they were so under resourced they got nothing done.

michaelcampbell · on Feb 8, 2023

I've found this to be true in the (albeit small) # of samples I've run up against.

There's an agenda by some level of management, and they use "data" to forward their agenda, or disregard it due to "unexplainability" (a legitimate concern for BD/ML) if it disagrees with the agenda.

hpen · on Feb 8, 2023

I would argue that insights from statistical models “[are] just one parameter in a complex equation.” No doubt PR from political issues can help the bottom line of a company. Can also hurt it. Just to be clear I’m not arguing political choices are inherently good or bad.

gymbeaux · on Feb 7, 2023

Replace “data scientists” with “software engineers” and you have another accurate insight. They don’t want to listen to us about how to write software or derive value from data.

commandlinefan · on Feb 7, 2023

> a huge amount of thought

Which itself both takes time and is wildly unpredictable - neither of play well with today's Taylorist managements schemes.

Consultant32452 · on Feb 7, 2023

I don't want you to tell me what the data says, I want to tell you what the data says and you go find data that confirms it.

kthnxbye

bcrl · on Feb 8, 2023

It doesn't matter what the data says if it's telling you to go in a direction that is unethical.

sportstuff · on Feb 7, 2023

No data is better than Bad data

naijaboiler · on Feb 8, 2023

Or even worse, bad understanding of data

aicharades · on Feb 7, 2023

deleted- length

LeanderK · on Feb 7, 2023

of course it won't and it's really ironic that you reply with the next hype

travisgriggs · on Feb 7, 2023

I've made anecdotal observations similiar to this over the last 10 years. I work in AgTech. A big push for a while here has been "more and more more data". Sensor-the-heck out of your farm, and We'll Tell You Things(tm).

Most of what we as an industry are able to tell growers is stuff they already know or suspect. There is the occasional suprise or "Aha" moment where some correlation becomes apparent, but the thing about these is that once they've been observed and understood, the value of ongoing observation drops rapidly.

A great example of this is soil moisture sensors. Every farmer that puts these in goes geek-crazy for the first year or so. It's so cool to see charts that illustrate the effect of their irrigation efforts. They may even learn a little and make some adjustments. But once those adjustments and knowledge have been applied, it's not like they really need this ongoing telementry as much anymore. They'll check periodically (maybe) to continue to validate their new assumptions, but 3 years later, the probes are often forgotten and left to rot, or reduced in count.

gffrd · on Feb 7, 2023

I've observed the same in manufacturing … and fitness trackers a la FitBit.

There's initial value from training yourself on what something looks/feels like … but diminishing returns after that. Whether there is more value to be found doesn't seem to matter.

Factories would sensor up, go nuts with data, find one or two major insight, tire of data, and then just continue operating how they were before … but with a few new operational tools in their quiver.

Same is true of fitness trackers: you excitedly get one, learn how much you really are sitting(!), adjust your patterns, time passes … then one day you realize you haven't put it on for a week. It stays in the drawer.

Not unless they're threatened with ruin will people make changes to the standard way of doing things. This is actually … not bad! Continuity is important, and this is kind of a subconscious gating function to prevent deviation from a proven way of working. So, the change has to be so compelling or so pressing that they're forced to. Not a bad thing.

While we think things change overnight in this world, they generally take awhile … stay patient … it's worth it.

pradn · on Feb 7, 2023

I went on a diet a few years ago. I obsessively recorded every food I ate in MyFitnessPal. To this day, I know roughly how many calories pretty much everything I eat is. So, I've learned from the process and don't need the process as much any more. (I'm kidding about that - it's easy to underestimate how much you eat, and an extra 200 cal a day adds up over the years.)

anthomtb · on Feb 8, 2023

I did something similar in my early 20’s. 15 years later the weight is gone and hasn’t come back. It is pretty nice being able to eyeball my plate of food at a BBQ knowing it will not add any flesh to my mid section.

I definitely did not need a multi-PB storage array to reach that goal. Or did I? I am sure someone out there knows roughly the human brain storage capacity in MB.

kulahan · on Feb 7, 2023

This is interesting, and I think you're on to something. I got a fitbit when I got serious about running because I had no idea what it felt like to run in zone 2. I can read about it, but that only gets you so far. While running I could actively check my heartrate and adjust. A year into the hobby and now I don't bring my fitbit on probably 50% of my runs. I know how fast I need to go.

jossclimb · on Feb 8, 2023

Where as a Garmin is a useful ongoing tool for me a runner. I have got through about six watches. I don't use gimmicks like step counters though, turned it off. "Congratulations you completed 10,000 steps!" - well duh, I ran 10 miles this morning.

red-iron-pine · on Feb 8, 2023

if youre running 10 miles per morning, even half that, on the regular you're an extreme outlier. the vast, vast majority of folks don't get anywhere near that, and a step counter that gamifies making them move is a good thing.

carabiner · on Feb 7, 2023

I used to track sooo much health and fitness data... Then I realized it mostly wasn't actionable, or at least, I wasn't altering my decisions based on it. The answer was always, "more training." So I stopped.

jossclimb · on Feb 8, 2023

I think they are only useful to runners, where weekly distance, speed and heart rate are massively useful. These are all very accurate things that can be captured, unlike weight resistance (swinging a kettlebell, dead lifting).

raindropm · on Feb 8, 2023

Your comment made me took my Fitbit out of drawer again :p At least I can have some fun number and graph about my health to look at when I want to.

barathr · on Feb 7, 2023

Classic paper on soil moisture sensors (from 2010!) -- the title says it all:

"Mate, we don't need a chip to tell us the soil's dry"

https://doi.org/10.1145/1858171.1858211

ff317 · on Feb 7, 2023

I tend to think the problem is the "random digging for correlations" part.

Having tons of data is a Good Thing, so long as you can afford the marginal cost of gathering and managing all that data so that it's ready at hand when you need it later.

It's how you use the data that makes all the difference. If you're facing an issue you don't understand at all, don't go digging for random correlations in your mountain of data to find an explanation.

Think like a scientist: you need a valid hypothesis first! Once you have a hypothesis about what your issue might plausibly be, then you make a prediction: "If I'm right, I suspect our Foobar data will show very low values of Xyzzy around 3AM every weekday night". Only then do you go look at that specific data to confirm or refute the hypothesis. If you don't get a confirmation, you need to go back to hypothesizing and predicting before you look again. You can't prove causation by merely correlating data.

counters · on Feb 7, 2023

> It's how you use the data that makes all the difference. If you're facing an issue you don't understand at all, don't go digging for random correlations in your mountain of data to find an explanation.

Absolutely. But in my experience, there's this massive trend across the tech world that flat out rejects the value of domain/subject matter expertise. Instead, all you need is an engineer who can throw some ML at the uncurated mountain of data your organization has collected. Little to no value is placed on the resources that can frame an actionable hypothesis, even though the entire value proposition arises from this exercise!

Maybe I'm just jaded. I end up wasting a lot of time trying to re-direct data scientists and engineers down more appropriate pathways than if the problem they're solving was just brought to my attention earlier. Sorry, I understand you spent two weeks shoe-horning dataset X into our analysis system for your work, but it's invalid for the question you're asking - use dataset Y instead, and you'll have an answer in an hour or two.

tbrownaw · on Feb 8, 2023

> But in my experience, there's this massive trend across the tech world that flat out rejects the value of domain/subject matter expertise. Instead, all you need is an engineer who can throw some ML at the uncurated mountain of data your organization has collected. Little to no value is placed on the resources that can frame an actionable hypothesis, even though the entire value proposition arises from this exercise!

Sounds like the data scientists need to get together with the MBAs and they can do companies where nobody needs to actually know what they're doing.

naijaboiler · on Feb 8, 2023

No youre not. The next time i have to listen to some idiotic take that domain knowledge doesn't matter, i might shoot myself

naijaboiler · on Feb 8, 2023

This!!!! The amount of data scientists i come across who don't get this. If you're not knowledgeable enough to form reasonable hypothesis about the data, you have no business touching data

daveguy · on Feb 8, 2023

Even better... Once you have ~20 hypotheses, you'll get one right!

PeterisP · on Feb 7, 2023

Fine-grained measurement is useful when you have options for fine-grained action.

You don't need a chip to tell you that the soil is dry, but if you can use that chip to regulate drip irrigation that can apply substantially different flow to different plants, then you can get a not-too-much, not-too-little watering even if you have a big variation in conditions.

You don't need a big analysis to acknowledge that everybody knows that a particular competitor has lower or higher prices and adjust your pricing; but doing that continuously on a per-product basis does require data and analysis.

ryguyrg · on Feb 7, 2023

Agreed. But how many executives will agree to take these fine-grained actions to achieve value from the data? How many data teams are able to build up a strong-enough argument to convince them?

I've worked on many product-led-growth initiatives in the software industry. The software industry is probably the biggest 'believer in data' there is -- many scientific-forward minds who understand the value. However, even in the software industry, it's really hard to convince folks that if you make 5 improvements that net 1% conversion gain each, you can dramatically improve revenue.

chudi · on Feb 7, 2023

Most of the time this story is true, but think this way, the person that was using the system was an expert on the subject. If you can replace the expert with a person just looking at a graph from time to time to know if you have to irrigate the soils it's a different thing. Most of the data or ML tools show us something that the client as an expert already knows, but the true power of this tools is to give them to a non expert user and have roughly the same level of proficiency

didip · on Feb 7, 2023

I've been telling folks, storing everything all the time is wasteful, a better alternative is:

1. Keep the raw full data for short period of time, at most 1 month.

2. Downsample what you need for longer period of time (5-10% of the full data).

3. Aggregate your metrics on a yearly basis to save money and compute costs.

tuatoru · on Feb 7, 2023

Yeah, there is perhaps a data usefulness duck curve. First data useful for specific immediate problems, not a lot of use for a while after that, then 15 or 20 years later, the big picture trends start to provide value for big decision making.

Not many orgs keep their data that long, though. Or even think about the future that far.

jschveibinz · on Feb 7, 2023

This analysis reminds me of the big interest in the use of hyperspectral imaging for agriculture. The idea was the greater spectral resolution (greater than Landsat) would result in more interesting information. Agriculture was one of the applications. But, once you did find the interesting stuff, you no longer needed a hyperspectral sensor. You could just look at one spot with a much lower cost sensor.

So hyperspectral, like big data, is useful up front. But in the end, much simpler tools and algorithms will solve the problem on a continuing basis.

azubinski · on Feb 7, 2023

Oh, those soil moisture sensors, they are so fascinating.

I spent a number of exciting year developing a high frequency soil impedance scanner and finally understood why I was doing it. To confirm the obvious :)

pron · on Feb 7, 2023

Interesting. Sounds as if what's really needed isn't so much collecting and analysing lots of data, but an alarm that's triggered when observations deviate from a set of assumptions. Observations that confirm some definition of "normalcy" -- as most observations would -- can be discarded.

ladyattis · on Feb 7, 2023

I think there's a problem at the heart of the matter, specifically the idea that the act of measurement is in itself powerful when in point of fact that this isn't universally the case. As the old adage goes: "garbage in, garbage out." Even more troubling, there is a physical limit to our ability to model what we measure. Take the retina, it has around a million light receptors and even if you assumed they only have two valid states then you're left with around 10^300,000 bits of information to process, so good luck with that. Same thing applies to whatever firms are measuring and what they think is conveying relevant information as they'll have similarly exponential increases if they don't filter out the vast majority of irrelevant data points and states.

gmfawcett · on Feb 7, 2023

> it has around a million light receptors and even if you assumed they only have two valid states then you're left with around 10^300,000 bits of information to process...

That would only be a million bits (1 Mb). You're counting potential states, not bits.

esel2k · on Feb 7, 2023

Very interesting. Left AgTech last year but had similar experiences, even worse where often the single most prominent use-case was to follow some painful necessary documentation of ag inputs (chemical, seeds, fertiliser) to get subsidies. Real inputs from data? Nah!

e12e · on Feb 7, 2023

But isn't this the essence of industrialization and automation? Messure, adjust process; repeat until feedback loop is stable - document and keep doing the thing that works, over and over?

If you want Toyota style continuous improvement you would need to improve in new areas of the process / new metrics, most of the time?

calvinmorrison · on Feb 7, 2023

On the flip side, there's some great action coming from data insights. Look at Strella Biotech - they're putting sensors in sealed warehouses to detect spoilage for certain vegetables and fruits. That's something that can have great returns with just a few IoT devices and a novel sensor.

fijiaarone · on Feb 8, 2023

Or they could put on their work boots and go walk around the field and kick a few dirt clods, or science forbid! put a hand in the soil to check the moisture content.

0xdeadbeefbabe · on Feb 7, 2023

> goes geek-crazy for the first year or so

The problem is that they don't stay geek-crazy?

ryadh · on Feb 7, 2023

While I get that they're sometimes useful to trigger debate, I don't really subscribe to very bold statements.

We are drowning in data, it's all around us. Information overload is real. Data enables most of our daily digital experiences, from operational data to insights in the form of user facing analytics. Data systems are the backbone of the digital life.

It's is an ocean and it's all about the vessel you pick to navigate it. I don't believe that the vessel should dictates the size of the ocean, it's simply constrained by it's capabilities. The trick is to pick the right vessel for the job, whether you want to go fast, go far or fish for insights (ok, I need to stop pushing on this metaphor )

This visionary paper from Michael Stonebreaker (2005) predicted it quite accurately and I think is still relevant: https://cs.brown.edu/~ugur/fits_all.pdf

Databases come in various flavours and the "trends" are simply a reflection of what the current era needs

Disclaimer: I work at ClickHouse

fuziontech · on Feb 7, 2023

100% agree. One of the biggest assets we had at <driver and rider marketplace app> was the data we collected. We built models on it that would determine how markets were run and whether drivers and passengers were safe. These were key features that enabled us to bring a quality service to customers (over ye ol' taxi). The same applied to the autonomous cars, bikes, and scooters. We used data to improve placement of vehicles to help us anticipate and meet demand. It was insane how much data used to build these models.

To say big data is dead sounds to me like someone desperate for eyeballs.

I do think there is a huge opportunity for DuckDB - running analytics on 'not quite big data' is a market that has always existed and is arguably growing. I've seen way too many people trying to use Postgres for analyzing 10 Billion row tables and people booting up an EMR cluster to hit the same 10 Billion rows. There is a huge sweet spot for DuckDB here were you can grab a slice of the data you are interested in, take it home and slice and dice it as you please on your local computer. I did this just this weekend on DuckDB _and_ ClickHouse!

Disclaimer: I work at a company that is entirely based on ClickHouse.

vgt · on Feb 7, 2023

Didn't know that Posthog is based on CH these days. Interesting!

zX41ZdbW · on Feb 8, 2023

Check the list of companies using ClickHouse: https://clickhouse.com/docs/en/introduction/adopters/

ayewo · on Feb 8, 2023

Really neat that you scour job postings to learn useful intelligence about companies using your product. I do this too :)

I'm curious how you have this set up. Is it currently a manual process or you use social monitoring tools to help you find mentions of ClickHouse in the wild?

gingerwizard · on Feb 8, 2023

Just use ClickHouse :) https://sql.clickhouse.com/play?user=play#U0VMRUNUICogRlJPTS...

ayewo · on Feb 10, 2023

Thanks for the reply :-) but your link is only for tracking mentions on the HN website.

I was asking about how they are able to track mentions, across the web, of companies using ClickHouse. This type of info is usually listed in the tech stack section of job descriptions (and these links tend to expire once the position is filled).

spopejoy · on Feb 7, 2023

I guess the article title is a "bold statement" but maybe the biggest insight in there is that people don't think hard enough about throwing old data away, and it hurts them. This is a liferaft for drowning in data and is more "bold" organizationally, as it actually takes a certain kind of courage to realize you should just throw stuff away instead of succumb to the false comfort that "hey you never know when you might need it".

Weirdly there's a similar thing that can happen to codebases, specifically unit tests and test fixtures that outlive any of their original programmers, nobody understands what's actually being tested and before each release lose days/weeks hammering to "fix the test". The only solution is to throw it away, but good luck getting most teams to ever do that, because of the false comfort they get -- even though that fixture is now just testing itself and not protecting you from any actual bugs.

I mean how often does Netflix need to look a viewing habits from 2015? Summarize and throw it away.

crazygringo · on Feb 8, 2023

I am baffled by this comment.

Throwing out unit tests? If you make a change and it fails a test, then you fix the bug or fix the test. I can't even imagine in what universe it's a good idea to throw away a test if it covers code in use. In what universe are unit tests "false comfort"? And if "nobody understands what's actually being tested" then you've got huge problems with your development practices.

Similarly, viewing habits from 2015 are tremendously important. There may be a show they're releasing soon that is most similar to a title released in 2015, and those stats will provide the best model. "Summarize" requires knowing how data will be used in the future, but will likely throw away what you need. Not to mention how useful and profitable vast quantities of data are for ML training.

Storing data is incredibly cheap. I'm actually curious where this desire to throw away old data comes from? I've literally never encountered it before, and it flies in the face of everything I've ever learned. The only context I know it from is data retention policies, but that's solely to limit legal liability.

fijiaarone · on Feb 8, 2023

Unit tests are only potentially of value if the code is changing. And 90% of code never changes. And 99% of unit tests never fail. Almost all of the value of unit tests come at the time of writing (a tiny percentage of) them.

After that, they become a liability that slows down builds, makes changes brittle and code based schlerotic.

A few good unit tests are a lot better than a bunch of bad ones. And even from your statement we can tell a much more pernicious risk — the false beliefs that code coverage measures whether code is tested and that a code coverage percentage is a mark of quality or safety in its own right.

spopejoy · on Feb 8, 2023

The unit test story is indeed bizarre. Done right unit tests should test the unit, and you'll never hit these problems.

The villians here were monstrous test fixtures instead of mocks, "testing the fixture" instead of testing the code. Both were agency trading systems so "platforms" of a sort that needed significant refactoring to mock properly, so instead tests had to inject essentially fake concrete services.

Somehow I joined teams twice in my career that were trapped under this (who both indeed had "huge problems with their development practices") as their only coverage. The only way out is to write all new unit tests.

fijiaarone · on Feb 8, 2023

Mocks are the personification of bad data. The only meaningful measurement derived from tests with mocks is how bad the architecture is.

spopejoy · on Feb 13, 2023

I don't know what you're criticizing here. I was contrasting "mocks" and "fixtures" in the context of unit tests as ways to instrument services depended on by the code under test.

A "mock" in this paradigm is some kind of testing technology that allows you to directly instrument return values for function calls on the dependent service, whereas a "fixture" is some concrete test-only thing you coded up to use in your tests.

If a fixture just acts as a dummy return-value provider, no problem (but you probably should have used a mocking solution). The problem that arises is fixture code that simulates some or all of the production service code, and/or (even worse) allowing modification of production code to allow use as a test fixture. This is the way to madness.

shswkna · on Feb 8, 2023

The heading is definitely “clickbait-ey” but the quality of the content was worth it. I probably would have missed the article without the headline. And I am already applying the insights gained.

taftster · on Feb 7, 2023

This posting was great. Highly recommended reading through. It gets really good when the author hits "Data is a Liability".

> An alternate definition of Big Data is “when the cost of keeping data around is less than the cost of figuring out what to throw away.”

This is exactly it. It's way too hard to go through and make decisions about what to throw away. In many respects, companies are the ultimate hoarders and can't fathom throwing any data way, Just In Case.

Really appreciated the post overall. Very insightful.

As an anecdote to this article, when business folks have come up to me and asked about storing their data in a Big Data facility, I have never found the justification to recommend it. Like, if your data can fit into RAM, what exactly are we talking about Big Data for?

CliffStoll · on Feb 8, 2023

  In a larger sense, it's a challenge to throw away stuff, just as it's difficult to trim big data.  

  As I reach retirement, our attic, bookshelves, and cabinets must be trimmed -- and each item requires attention and a decision.
  
  Some things in the attic are obvious liabilities (what to do with a mercury barometer? A radium dial pocket watch? Old electronics?)  Disposing of other stuff requires time, insight, and a sense of the future (should we keep those fingerpainted scribbles from when the kids were 3?  How about those cheesy trophies from chess club? Computer books from the 1970's? Betamax home movies? Record albums?)

selestify · on Feb 8, 2023

Where is this excerpt from? Did you write it yourself?

roeles · on Feb 8, 2023

For what it's worth: I throw things away when I haven't touched them in a year.

disqard · on Feb 8, 2023

> if your data can fit into RAM, what exactly are we talking about Big Data for?

That's a fantastic point, and I keep mentioning the COST paper to anyone who cares:

https://www.usenix.org/system/files/conference/hotos15/hotos...

DriverDaily · on Feb 8, 2023

COVID was proof The Right Data is better than Big Data. All those data sources to measure how many sick people we have and it turns out we just need one: Wastewater.

bombcar · on Feb 8, 2023

Or another way to look at it - if we make a data lake that collects everyone’s shit we might find something useful in it!

guardiangod · on Feb 7, 2023

There is literally a post on front page on ChatGPT, and Microsoft and Google are preparing to duke it out starting in the _next 2 days_ over big-data generated 'chat' result.

Big data was never going to be useful to even medium size enterprises, unless anyone can get public access to PBs of data, but that doesn't mean big data is dead. ChatGPT is literally changing how school will test their students, for a start.

Maybe what the author is trying to say is 'small-scale big data is dead, but big data chugs on.'

miguelazo · on Feb 7, 2023

>ChatGPT is literally changing how school will test their students, for a start.

Sure, instead of schools checking for plagiarism from other students' papers using turnitin.com, they'll check for plagiarism using ChatGPT tools that scan for known output from their industrial-scale amalgamation of plagiarized materials. Big whoop.

maliker · on Feb 7, 2023

It appears that it's hard to detect AI generated content. E.g. true detection rates are only around 25% and there are also techniques to further mask output [1].

[1] https://www.nbcnews.com/tech/innovation/chatgpt-can-help-foo...

bombcar · on Feb 8, 2023

You only need enough to flag suspect content, and then the teacher calls the student in for a quick oral exam - the fakers will flounder, the reals will pass.

wongarsu · on Feb 8, 2023

Even if you understand the material well, doing assigned tasks takes a lot of time. Especially if it's free form text. And "I want to do something more interesting right now" is at least as powerful a motivator to cheat assignments as "I don't know how to do this".

red-iron-pine · on Feb 8, 2023

Aye, if I mostly know a subject why bother to take the time to write 10 pages when I can outsource that to a bot, and if they question me I can do a 20 min oral interview and save myself the trouble?

seems like a no-brainer in the long term.

Loughla · on Feb 7, 2023

I mean, that's certainly part of it. But you're also seeing very fast restructuring of individual courses (because programs overall will definitely take years to restructure because higher education moves so damned slow) to account for these tools.

In the small institution I am currently working with, the English courses, in one week, integrated chatgpt as a tool for students to work with. It's part of the collaborative idea building and development process now for every student enrolled in creative writing and writing analysis classes, and that happened in one week. I cannot stress enough how unbelievably fast that is for higher ed. That's faster than light speed.

And we're not even that well resourced. I have to imagine there are other examples where it's more than just running through a bot to scan for known outputs.

stcroixx · on Feb 7, 2023

Sounds like they simply panicked and threw something together with little thought or preparation, what, right in the middle of an actual course? And they want to charge kids for this kind of 'expert instruction'? I'd be pissed as a student.

Loughla · on Feb 7, 2023

What a weird assumption you made there. In what way does what I wrote sound panicked? Because it was a week? Yes it was fast, but it was a massive effort of the entire english faculty.

It's integrated into existing assignments, modifying processes that are super well established already. It was like integrating a new person into the class. Also, it was before semester, so the students literally saw nothing weird; that was another strange assumption for you to make.

Just a tip, and don't read tone in this statement, but don't assume things. 9/10 times you're going to be incorrect. It's much better to ask questions, instead of making statements with question marks at the end of them.