I thought the back and forth between Johann and others deep in that thread added...

Bartweiss · on May 24, 2017

A line from Andrew Gelman I really appreciate:

"Again, they’re placing the original study in a privileged position. There’s nothing special about the original study, relative to the replication. The original study came first, that’s all. What we should really care about is what is happening in the general population."

There are two very different questions about replication.

One is whether the study got its results by chance, including forced-chance techniques like forking paths and salami slicing. This can be handled with either preregistration or exact replication. (And at p < .05, replication is a must because 5% of un-forced results will still be off!)

But the other is whether the study got its non-chance results by methodological flaws or an actual insight about the world. Exact replications are no good for this - doing the wrong thing twice is no better than doing it once. The power poses study, for instance, used testosterone sampling procedures that introduced known confounders. What would help is a study equivalent of N-version programming: settle, preferably preregistered, on multiple tests for the same effect. If they all work, you win. If some work (repeatably) and others don't, you've either made a design error or found a different effect than the one you were looking for.

This also explains how to work your confidence levels (a topic discussed in that SSC thread). You can't replicate a study endlessly and gain confidence every time. Given a prior for P(effect), exact replications boost P(effect ∪ bad study), and your P(effect) belief is bounded by the odds of a methodology error. It's a point I'd never considered until that SSC post, and one a lot of actual researchers still seem to miss.