Response to the ASA’s Statement on p-Values

chalst · on Jan 20, 2019

Andrew Gelman, a pioneer of the methodology of Bayesian statistics (cited in original article; he doesn't call himself a Bayesian in the philosophical sense) responded to the ASA statement so:

http://www.stat.columbia.edu/~gelman/research/published/asa_...

His conclusion there is that scientific institutions are not accepting, as they need to be, of systematic uncertainty. Elsewhere he says that scientists who are not statisticians should concentrate on gathering quality data: noise in data often leads to spurious point estimates:

https://statmodeling.stat.columbia.edu/2017/02/11/measuremen...

In the first link, Gelman talks of the "garden of forking paths": this is essentially a generalisation of the idea of p-hacking to recognise that perfectly honest researchers will not conduct unbiased analyses because of the myriad of parameters that estimates depend on. The solution is to move away from summarising results through point estimates to construct statistical models where you can explore the space of possible analyses; there has been a revolution in the techniques for doing so through the application of Monte-Carlo Markov-chain techniques to construct posterior distributions.

outlace · on Jan 20, 2019

It seems clear to me that Popperian falsification is indeed the only way to separate the wheat from the chaff in theoryspace, however, generating theories generally requires inductive reasoning.

You observe data, you make an inference that leads to a theory (induction), you then subject that theory to falsificationist testing.

"It appears all the swans I've seen are white, therefore I posit all swans are white. Oh, wait, there's a black one, nevermind."

On the Bayesian vs Frequentist aspect... Falsification is what you should do to theories, not model parameters. If you have a coin and you're trying to figure out the probability of heads P(H), then you have your model of coin-flipping (Bernoulli process) and you're trying to estimate the model's parameter, so you do statistical inference given some sequence of coin flips.

It doesn't seem right to apply frequentist null testing because you want to estimate the model parameter, not make some binary decision. What if you had some prior data you want to include? Or you observe new data in the future? This is exactly what Bayesian inference is setup for. And a lot of science is not about falsifying theories in theoryspace but about estimating model parameters in parameter space in the case we all agree on a particular model.

Moreover, a big advantage with Bayesian statistics is that it generally requires you to make your model and assumptions explicit, where it's much easier to scrutinize the model compared to a frequentist statistical test.

claudiawerner · on Jan 20, 2019

> It seems clear to me that Popperian falsification is indeed the only way to separate the wheat from the chaff in theoryspace, however, generating theories generally requires inductive reasoning.

A report analyzed the entries to Nature one year and found that very, very few of the papers actually met Popper's criteria for falsifiable hypotheses, and in fact most of the papers started out with an exploratory aim and documented their findings. This means that to adopt the Popperian idea that only falsifiable claims are science means we must reject good science. One must also consider what opportunities are being missed if we were to force good science to adopt strictly falsifiable hypotheses before research commences - this would mean that every exploratory paper would need to be redefined (or rather, firstly defined) as a falsifiable hypothesis... in an attempt to please Popperians.

It is also a stretch to say that science must be empirical or carried out in a particular way to which falsifiability is congenial. There are good arguments that even philosophy[1] or mathematics can be considered sciences, and indeed they were (see Wissenschaft in Kant and Hegel for instance).

Furthermore, it's been argued that Popper's theory of falsification actually includes bad science (pseudoscience). From SEP[0]:

>Strictly speaking, his criterion excludes the possibility that there can be a pseudoscientific claim that is refutable. According to Larry Laudan (1983, 121), it “has the untoward consequence of countenancing as ‘scientific’ every crank claim which makes ascertainably false assertions”. Astrology, rightly taken by Popper as an unusually clear example of a pseudoscience, has in fact been tested and thoroughly refuted (Culver and Ianna 1988; Carlson 1985). Similarly, the major threats to the scientific status of psychoanalysis, another of his major targets, do not come from claims that it is untestable but from claims that it has been tested and failed the tests.

[0] https://plato.stanford.edu/entries/pseudo-science/#KarPop

[1] https://web.archive.org/web/20170621073301/http://www.philos...

apathy · on Jan 20, 2019

Analyzing papers in a glam journal for a year isn’t a very good way to judge what counts as science in terms of what is publishable. (“Just because it’s in Nature doesn’t mean it’s wrong!”)

That said, I agree that most of science is about becoming less wrong (choosing the least worst among available models for how observations arose), and that type of model selection problem is what Bayesian approaches excel at.

“Considering the evidence that preceded this study (or trial, or series), what explanatory model does the weight of the evidence best support?”

This need not be a binary decision (witness Bayes model averaging) and it need not be static (Bayesian updating explicitly evolves from a prior, whether flat or subjective). But it better matches the way most people approach research and probability, imho. Extraordinary claims must be supported by extraordinary weights of evidence, and frequentist testing in a vacuum doesn’t enforce this intuition.

Disclaimer: I am a statistician and a part-time Bayesian. Not a zealot, and not a jihadist against subjective inference. I use empirical Bayes procedures when they improve results on an ongoing basis, and I avoid them when the effort is more than the expected benefit can justify. The tipping point has moved over time, as more, faster, and better tools have decreased the cost (effort) to obtain a posterior distribution over complicated models.

outlace · on Jan 20, 2019

I don't think my position disagrees with you. I make a distinction between theories in theoryspace and parameters in parameter space (given a model we all agree on). I define theory as essentially a causal model and explanation of some phenomenon, whereas a model may or may not be causal.

Much of science (particularly the life sciences) is estimating model parameters, and not actually positing new theories. For example, if you're testing a small molecule drug to see if it lowers blood pressure, you want to know how much it affects blood pressure. This is a parameter estimation problem. We all agree that the molecule will get into the body and interact with other molecules, we just don't know how much these molecular interactions will affect blood pressure. Since Nature publishes a lot of life science stuff, it's no surprise that most of their papers are about parameter estimation given an implicitly agreed upon model.

This sort of parameter estimation is important and very different than positing new theories (e.g. quantum mechanics or germ theory). It doesn't make sense to talk about falsifiability of a parameter value. It only makes sense to talk about falsifying an entire causal theory.

"All swans are white." is a theory about swan-ness and should be falsified. "What proportion of swans are white?" is a parameter estimation problem and it doesn't make sense to talk about falsification.

Theories must be falsifiable and subject to falsification. I don't think this precludes gathering corroborating evidence for a theory and then updating your belief about that theory, but it is very easy to find corroborating evidence for theories so falsification is the most practical means of actually finding truth without fooling yourself.

i.e. it is very easy to posit two different theories that make the same predictions in some domain, therefore all evidence in that domain supports both theories. The only practical way to distinguish them is to find another domain where they make different predictions and falsify one or both.

cbkeller · on Jan 20, 2019

Great points. It's one thing to say that a scientific theory should be falsifiable (as a necessary but not sufficient condition), but the Popperians lose me when they claim either (1) that the only valid scientific study is an attempt at falsification, (2) that corroboration (or in Popperian terms, a failed attempt at falsification) has no significance. Both of these tropes come up in this "response", where Ionides et al. use them to throw some pretty serious shade at the ASA:

> This approach to the logic of scientific progress, that data can serve to falsify scientific hypotheses but not to demonstrate their truth, was developed by Popper (1959) and has broad acceptance within the scientific community. In the words of Popper (1963), “It is easy to obtain confirmations, or verifications, for nearly every theory,” while, “Every genuine test of a theory is an attempt to falsify it, or to refute it. Testability is falsifiability.” The ASA’s statement appears to be contradicting the scientific method described by Einstein and Popper.

However, (1) completely ignores discovery science (something I have ranted about here previously), and implies fully accepting Popper's claim to have solved the centuries-old "problem of induction" [1]. Meanwhile, (2) seems to pop up pretty frequently even though I'm not sure Popper himself would have agreed, considering [2]:

> It is easy, [Popper] argues, to obtain evidence in favour of virtually any theory, and he consequently holds that such ‘corroboration’, as he terms it, should count scientifically only if it is the positive result of a genuinely ‘risky’ prediction

Ionides et al. conveniently left out the second half of that context. Hmm, if only there were some sort of way to quantify that "riskiness". Like some sort of theorem... maybe of the form P(H|E) = P(E|H)*P(H)/P(E)

[1] https://en.wikipedia.org/wiki/Problem_of_induction

[2] https://plato.stanford.edu/entries/popper/

claudiawerner · on Jan 20, 2019

That's a good point, I haven't actually spent a lot of time on Popper, though I was amused to find that under some interpretation of his argument, the most extreme varieties of obvious pseudoscience fall under science. In particular I wasn't aware the article even addressed Popper, in particular what's in footnote ii:

>Strictly speaking any method can be made to falsify claims with the addition of a falsification rule. However, the rule must still be shown to be reliable.

At the very least, I think it's shaky to try and apply the same standard regardless of the actual method of the science - since some fields take different approaches to others. That's not to say that all approaches are valid, but that one idea of science in one field shouldn't necessarily need to hold up in other one. Some sciences obviously greatly benefit (or did in the past) from even a simple statement of falsifiability, but to others it seems like a total mismatch for the problem domain.

I hadn't considered Bayes' rule to work that way (my only exposure to statistics is an intro course on information theory as part of an EE degree) but if I'm right, this opens up a discussion as to how risky is risky enough - is astrology still in the picture then?

cbkeller · on Jan 20, 2019

Without having looked into it, my impression is that the popularity of astrology (and psychics, etc.) is down to the fact that they tend to make broad, vague predictions with a relatively high P(E) that people (emotionally) tend to underestimate. For instance "something bad will happen to you this month" -> "OMG, I stubbed my toe the other day, they were totally right". If we're accurate at assigning P(E), however, then we'll probably end up with a low P(H|E).

The prior P(H) is also arguably trouble. What's a good prior likelihood for the hypothesis that magic sky gods affect your fortune? I don't know, but I'd probably have to start somewhere pretty low.

roenxi · on Jan 20, 2019

> I was amused to find that under some interpretation of his argument, the most extreme varieties of obvious pseudoscience fall under science

You can't define science as 'the right answer' because that is unknowable (notwithstanding that there are a lot of things where if we havn't figured out the truth then the universe is set up to radically deceive us).

And if you accept that unknowability as a premise, then pseudoscience is a perfectly valid science. Something can be completely wrong and still be established in a highly scientific way.

Just because something is provably wrong doesn't make it unscientific. It just makes people who believe it provably wrong. They are likely to be stupid, I admit.

JohnStrangeII · on Jan 20, 2019

Short summary first, long rant follows. Summary: Mathematics and small parts of philosophy and many other disciplines rightly qualify as non-empirical science, but no lesson can or should be drawn from them about empirical science. Empirical questions are fundamentally different from non-empirical questions.

Long rant:

I don't disagree fundemantally but what you say about philosophy irks me as a philosopher. If you looked at a random sample of publications in philosophy you would find out that only a very small percentage of them has the rigour and exactness that we associate with science. I have also met many philosophers, perhaps even the majority of all I've met so far, who would not describe themselves as scientists.

The remaining part of philosophy that adheres to strict standards and uses mathematical method is akin to mathematics and computer science, and perhaps most similar related to formal linguistics and economics in the way it works. I personally consider this part of philosophy a science, it is a kind of applied mathematics, although often speculative and conditional on axioms that are not as evident as in mathematics. Other disciplines have similar non-empirical parts, a typical example is Social Choice theory in sociology, which I would consider a scientific theory, although it is not empirical. In the end, it's applied mathematics.

I agree that mathematics and the relevant rigid and formal parts of other disciplines are non-empirical science, but to these areas the debate about hypothesis testing and the right use of statistics simply doesn't apply. It a fallacy to presume that, because these fields exist, clearly empirical disciplines (or parts thereof) could do without statistics. If a question is empirical, then it has to be addressed with proper quantitative methodology.

This is extremely important to me personally, since I've been in deep disagreement with colleagues for many years about this issue. They work in related disciplines within our philosophy institute and habitually make "qualitative empirical analyses" of texts without treating them as mere precursors to quantitative studies. They see no problems with their methodology, even when I point out to them that the size of their samples would be too small to support the generalizations they make if they did make quantitative studies. To me, this is absolutely crazy, I just can't see how a qualitative analysis of 20 texts could allow you to take these as representative for hundreds of thousands of texts if a quantitative studies of the same texts could not possibly reveal anything useful because the sample size is too small. What's worse, their whole discipline seems to be based on these kind of extremely small scale qualitative empirical studies, plus a very vague mix of fairly imprecise philosophy and common sense. I'm a nice person and get along with my colleagues well, so I won't ever tell them my opinion, but if I'm honest, I'd say that their discipline is pseudo-science or, at best, imprecise, non-scientific philosophy in disguise. (To make this clear, I have no problem with imprecise philosophy and have done it occasionally myself, I just don't think it can qualify as science and not many philosophers consider it as such.)

Long story short, empirical question have to be addressed with quantitative methodology or you get what I'd call "elaborate opinions".

cbkeller · on Jan 20, 2019

Sounds fair enough.

I do wish that physical scientists, including myself, had more philosophical training though, even if its not science per se, such that we could reliably have original, educated opinions on empiricism, rationalism, etc.

learnfromerror · on Jan 20, 2019

Might I mention my book (Statistical Inference as Severe Testing: How to Get Beyond the Science Wars (Mayo CUP)). https://twitter.com/learnfromerror/status/107197020637171302... I did not know of Hacker news but was trying to trace why my blog errorstatstics.com got its maximal # of hits in over 7 years as a result of my posting their letter on the ASA Statement Now I'll disappear.

JohnStrangeII · on Jan 27, 2019

Very interesting, I'll order your book tomorrow!

cbkeller · on Jan 21, 2019

I'll have to check it out!

claudiawerner · on Jan 20, 2019

I feel (as a non-philosopher who only casually reads) as though the question of empirical methods in philosophy comes down in large part to critiques of positivism, operationalism and therefore also the Frankfurt School's aim to synthesise the social, economic, historical, philosophical (if you'll bear with me) and juristic sciences in a total approach rather than restricting to one area. It seems that one could avoid the onus to pursue quantitative methodology by rephrasing the question not as an empirical one. Your example of the twenty texts seems to call for an empirical analysis, but perhaps identifying "common themes in the genre" might not, depending on how the point is elaborated. It's another discussion as to whether such a question should be answered with empirical methods. To me it's a mistake to conflate rigor with empirical analysis, but you seem to agree when you talk about mathematics. Philosophical arguments can be rigorous, but a lot of what counts as some flavor of philosophy (the synthesis mentioned above) is critical and some really does make the claim to be rigorous. So it seems there's now a choice: either to say philosophy is not a science because it is not rigorous (due to some of its elements), or to say that philosophy can be scientific, just as information theory or theoretical physics, or theoretical computer science can be scientific.

Many philosophers don't consider themselves scientists but I think this is at least in part because the mainstream conception of science really is empirical rather than rigorous analysis. In fact, even non-rigorous empirical analysis can pass for science. Depending on the area, I think that qualitative methodology can capture what quantitative can't, or where quantitative would be unreliable or impossible to use without changing the result significantly.

In a nutshell, philosophy can be scientific if a method is pursued rigorously, but a lot of what is called philosophy is critical theory which is important if not scientific (despite using qualitative and quantitative analyses where appropriate). I have a feeling that critical theorists would question the separation we're erecting between science and non-science, especially those who view critical theory as in part an artistic endeavor too.

tzs · on Jan 20, 2019

> "It appears all the swans I've seen are white, therefore I posit all swans are white. Oh, wait, there's a black one, nevermind."

That's such a tedious way to to test a theory of swan color. You have to go find swans, which tend to live in hard to reach, dirty, obnoxious environments.

It's much easier to rephrase the hypothesis into a logically exactly equivalent hypothesis, "All non-white things are not swans", and test that instead. I can test that without even leaving my house. Just looking around the room I'm in there are hundreds of non-white things, and none of them are swans.

This situation is called Hempel's paradox, also known as the raven paradox or the paradox of indoor ornithology [1].

[1] https://en.wikipedia.org/wiki/Raven_paradox

kevinpet · on Jan 20, 2019

The "black swan" story always struck me as ridiculous in confusing a phrase informally indicating rarity with a scientific belief that something doesn't exist or can't happen.

astazangasta · on Jan 20, 2019

It also confuses mathematical formalism with scientific method; the former is built on rigor, the latter trades exclusively in approximation.

raverbashing · on Jan 20, 2019

Yeah the article is trying to associate "using p-values" with "The (Only) True Way Of Doing Science™" which to me seems like a fallacy in itself.

cbkeller · on Jan 20, 2019

Obligatory xckd: https://xkcd.com/2078/

learnfromerror · on Jan 20, 2019

The authors of the letter use “induction” to mean what I call a probabilism. Here probability is used to quantify degrees of belief, confirmation, plausibility support–whether absolute or comparative. This is rooted in the old idea of confirmation by enumerative induction. Conclusions of statistical tests go strictly beyond their premises, and it does not follow that we cannot assess the warrant of the inference without using a probabilism. They are qualified by the error probing capacities of the tools. A claim is severely tested when it is subjected to and passes a test that probably would have found it flawed if it is. The notion isn’t limited to formal testing, but holds as well for estimation, prediction, exploration and problem solving. You don’t have evidence for a claim if little if anything has been done to probe and rule out how it may be flawed. That is the minimal principle for evidence.

Popper spoke similarly of corroboration only he was unable to cash it out. He wasn’t sufficiently well versed in statistics, and anyway, he wanted to distinguish corroboration from induction, as the latter was being used at the time. The same impetus led Neyman to speak of inference as an act (of inferring). I explain all this in my recent book, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo 2018,CUP. As I say there,

“In my view an account of what is warranted and unwarranted to infer – a normative epistemology – is not a matter of using probability to assign rational beliefs, but to control and assess how well probed claims are.” (p. 54)

DSingularity · on Jan 20, 2019

Okay, so we agree with some of these points. But after reading the ASA statement it seems like this was written to something other than what I read. The ASA statement does not seem to be fundamentally dismissing P-values. On the contrary. They seem to be providing guidelines for researchers and reviewers on how to use them properly.

learnfromerror · on Jan 20, 2019

I think it is the way they blithely mention some statisticians prefer to use other methods, with a list of examples, that suggests they are blessing them. Surely these other methods ought to be scrutinized, we don't know they wold detect irreplication as significance tests do. The big issue is really all about using frequentist methods with biasing selection effects, multiple testing, cherry-picking, data-dredging, post hoc subgroups, etc. Only problem is that many who espouse the "other methods" declare that these data-dependent moves do not alter their inferences. Some are vs adjustments for multiplicity, & even deny error control matters (This stems from the Likelihood Principle.) If you consider the ASA guide as allowing that (in tension with principle 4 vs data dredging when it comes to frequentist methods) then the danger the authors mention is real. What was, and is, really needed is a discussion about whether error control matters to inference.

wbl · on Jan 20, 2019

Hume would have a fit. All scientific knowledge is inductive!

cbkeller · on Jan 20, 2019

Yeah, or Francis Bacon, or any of the original Empiricists.

I find a lot more to like in the ASA's statement [1] than in any of these responses, which seem to act as if Karl Popper were the only one to ever have a worthwhile philosophy of science.

This paragraph of the response was particularly telling to me:

> A judgment against the validity of inductive reasoning for generating scientific knowledge does not rule out its utility for other purposes. For example, the demonstrated utility of standard inductive Bayesian reasoning for some engineering applications is outside the scope of our current discussion.

Translation: Ok, so maybe induction works fine when you're going to build a bridge where someone's life is on the line, but it still has no place in science. Falsification or bust!

[1] https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1...

astazangasta · on Jan 20, 2019

It's ridiculous. Even the idea that frequentist use of p-values is deductive is garbage. Where do they think the models being tested are coming from? Induction!

stillbourne · on Jan 20, 2019

Induction is impossible!

FabHK · on Jan 20, 2019

Well, it's worked so far, so it should keep working, no?

tfgg · on Jan 20, 2019

I'm guessing here that ASA is 'American Statistical Association', rather than say, the UK's 'Advertising Standards Authority'?

roenxi · on Jan 20, 2019

One of the joys of programming is that, unlike maths, when discovering an exciting new paradigm there is an opportunity to rationalise all the notation and use a new language.

Contrasting, I struggle to deal with probability and statistics without developing a strong suspicion that the name the objects are called by is completely different from what the name means in common English.

It is nice to see ongoing authoritative commentary that the large majority do not understand what a p-value actually implies. The thread of discourse seems to be that even assuming that all the academics are completely honest (ie, no academic fraud, no hand waving) the number of false results that are awarded statistical significance is much higher than it should be. The standard p value threshold at 5% does not imply that 95% of the statistically significant studies are not by chance. Particularly amongst the subset that make it into the public eye.

BenoitEssiambre · on Jan 20, 2019

The language of statistics is indeed dishonest. In normal language "significant" has a meaning close to "large". What they mean in statistics is that there is a detectable signal, there is a 95% chance that the data is not pure noise.

They should call it "detectable", some people have suggested "discernible".

However, there is another fatal flaw to how p-values are used. They are usually used for rejecting infinitesimally small hypothesis. The null hypothesis is stated as "Effect exactly equals 0.00000000000...". In practice, there are no experiments that have exactly zero effect. There are always at least very small systematic bias due to imperfectly calibrated instruments or small methodological variations between researchers.

Even if you do everything else right, if you pre-register your study, with enough data, a null hypothesis test will always pick up on these small biases and make the results significant.

If you are looking to reject a null hypothesis, I can tell you in advance: all experiments have a non zero bias, all results are statistically significant p<0.000000001 with enough data. There, I just saved the scientific world a ton of money, they don't have to do all these experiments. Just reference this comment in your paper to show significance.

At least here, the language seems honest. Rejecting a null hypothesis correctly conveys the idea of having rejected nothing.

Using a Popperian approach is great, but you should reject a portion of the hypothesis space this is bigger than zero.

maxnoe · on Jan 20, 2019

"They" is a pretty broad term.

Physicist's convention is to call a 99.7 (3 sigma) per cent chance a "hint" and only a 5 sigma effect a detection.

joshuamorton · on Jan 20, 2019

This isn't the case. I can make two hypotheses that are contradictory. Both aren't going to be statistically significant from the same data, no matter how biased my equipment or the size of the sample.

BenoitEssiambre · on Jan 20, 2019

Yes, the problem is that with null hypothesis testing, the two contradictory hypothesis are usually: "effect is exactly 0.00000000..." and "there is an effect different than 0.000000000...". The problem stems from the fact that people pit an infinitesimally small hypothesis against an infinitely large one. Rejecting an infinitesimally small hypothesis is very very easy. The tiniest of experimental bias will allow rejection if you have enough data.

It would be impressive if "There is an effect different than zero" was rejected (but no one would ever be able to do this). Science should try to reject finitely large hypotheses, at a minimum something like: "effect is larger than some reasonable margin to account for experimental bias and imperfect tools". At least a small chunk of the hypothesis space should be rejected for your experiment to be worth something. You sorta get that with confidence intervals since you can see how far the lower bound is from zero.

joshuamorton · on Jan 20, 2019

I mean, you can still run I to issues here:

The effect size is <1% and the effect size is >=1%.

Or 100 different hypitheses for 1-100% effect sizes. They're mutually exclusive, so only one will be true.

So again, while in general I agree with you that beyesian methods have significant advantages, this objection isn't well founded.

FabHK · on Jan 20, 2019

Not sure I follow you there. Are you suggesting that better terminology would help? Even if there were better names, it's a pretty hard to change words that are so entrenched.

At any rate, "p-value" is a made up, artificial word, certainly better than using a common existing word (such as "likelihood" or "significance") which would be even easier to misunderstand.

The fundamental problem is that hypothesis testing happens within a whole theoretical framework, and the jargon refers to things well defined and understood within that framework. I think there is just no way of breaking it down further (though the implications and typical limitations of the research could maybe be communicated better).

diffeomorphism · on Jan 20, 2019

> One of the joys of programming is that, unlike maths,

This does happen in maths all the time. Pretty much all new results and new techniques also introduce new notation and language, which is then rationalized in further works by the community.

What tends to not change all that much currently is notation used in undergrad textbooks. However, going back more than a few decades books on for instance calculus look very different and use very different language (e.g. infinitesimals, topology, focus on power series etc.), similarly for linear algebra or complex analysis.