Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I often heard the Big Data guys hype that there's no sampling in Big Data, you have the whole data, so it's not exactly statistics.


I've heard this too and it's a great way to demonstrate you don't really know what statistics is :)

Statistics is not (just) opinion polling, there's a lot more to it than estimating observable properties of a population.

If you're trying to make decisions, predictions or estimates which involve any uncertainty at all (and in my experience big data almost always is), then it's definitely within the purview of statistics even if you have data for the whole population.

Sources of uncertainty include trying to say anything at all about the future (do you have data on the future population? no didn't think so...), trying to make predictions which generalise to new data in general, trying to uncover underlying trends or patterns behind the data you see which aren't directly or fully observed.

Often people expect big data to be able to answer big numbers of questions, estimate big numbers of quantities, or fit big, powerful predictive models with lots of parameters. In these cases statistics can be particularly important to avoid reporting false positives and to make sure you can quantify how certain you are about your results and your predictions. (Amongst other reasons).


Not to mention: having all the data, and comprehending all the rows on an individual level, are two very different things. Doubly so if the data is irregular (I'm currently doing fuzzy matching on really mangled street address data. ICK).

Once you hit millions of rows, it's not humanly possible to survey the data. All you can do is make assertions about the data's structure / buckets it will fall into. You then try to disprove that assertion, or establish an error bounds on it. You will never see all the data, only the results of assumptions you've made about it.


The refined pieces of information that people can look at to make decisions are called "statistics".


Presumably you want to draw an inference of some sort from the data. Otherwise what's the point of even looking at it?


from my distant memory if you sample size is the pollution its still statistics


Population I of course meant to say !


you mean Pig Data?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: