Hacker News new | past | comments | ask | show | jobs | submit login
Interpreting A/B test results: false positives and statistical significance (netflixtechblog.com)
76 points by ciprian_craciun on Oct 29, 2021 | hide | past | favorite | 21 comments



It's probably a good idea to remind (or inform) people that at least in scientific research, null hypothesis statistical testing and "statistical significance" in particular have come under fire [1,2]. From the American Statistical Association (ASA) in 2019 [2]:

"We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.

Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless."

[1] The ASA Statement on p-Values: Context, Process, and Purpose - https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1...

[2] Moving to a World Beyond “p < 0.05” - https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1...


The ASA recently published a new statement which is more optimistic about the use of p-values [1]. I myself also think that correctly used p-values are in many situations a good tool for making sense out of data. Of course, a decision should never be conducted on a p-value alone, but the same could also be said about confidence/credible intervals, Bayes factors, relative belief ratios, and any other inferential tool available (and I‘m saying this as someone who is doing research in Bayesian hypothesis testing methodology). Data analysts always need to use common sense and put the data at hand into broader context.

[1] https://projecteuclid.org/journals/annals-of-applied-statist...


Is the nuance here that the ASA is OK with p-values but not OK with the rhetorical phrasings around statistical significance? My take is that it is easy to casually misinterpret or misrepresent statistical results because of how fuzzy these language around it all is. Phrases like "statistically significant" imply a certain kind of causality to the reader, when the actual rigorous claims are very specific and nuanced. Moving away from such soft phrasings might mean people have to stick to precise and narrow claims, whereas the normalization of soft phrasings makes room for bad claims or bad interpretations.


I read through the first link you posted and couldn't find any ideas about what we could use instead of p-Values.

Statistical tests are a very useful and objective method of determining whether the outcomes of one thing/activity are more desirable than another when applied correctly.

Some solutions could be to set a higher bar for statistical analysis education. Or perhaps a more thorough statistically focussed vetting and peer review process for published material?

.. and after reading the second link I note that it includes commentary around the insurmountably difficult challenge of replacing the current paradigm. While not offering a solution, it goes on to provide some great advice for how to improve the quality of statistical research.

Following all of the advice to the letter would make it almost impossible to conduct valid research.

Maybe this is where economists have the upper hand? Just develop some highly abstract mathematical models representing the topic you're interested in, and abstain from rigorously testing them.


Bayesian hypothesis testing, including Bayes factors, might be more useful.


The second link addresses what should be done instead. It's not entirely satisfying as there isn't yet consensus for a replacement, if such a thing is even possible.


It's worth pulling the principles from the ASA's statement [2] as well:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  
  4. Proper inference requires full reporting and transparency
  
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
The basic criticism one of brittleness - that unless very carefully planned, executed, and interpreted, p-values from hypothesis does not support the claims some would like to be on their results, and that meeting the first condition is so difficult that the technique should not be recommended. One _should_ look for 'significant' results, but using measures that align better with colloquial understandings of significance i.e. with how users are misinterpreting p-values now.


> P-values do not measure the probability that the studied hypothesis is true

So, what’s the best way to measure the probability that the studied hypothesis is true?


The argument against p-values is part of the argument against any bright-line single-number rule for identifying truth. The job of the researcher is to demonstrate (at least) that

1. there is (isn't) an observable difference between groups of interest

2. the difference is (not) attributable to hypothesized causal mechanism i.e. the (absence of a) difference isn't due to random variation in the observed sample i.e. the difference would be observed by a independent replication of the same analysis/study

3a. the difference is not explainable by other factors that vary between the groups, observed or unobserved

3b. the difference is not artificially inflated (suppressed) by the statistical choices

4. the difference is large enough to be practically relevant.

and so on

If the degree of certainty of statements about the difference can be characterized by a single number at the end of the process, great! But the goal should be a convincing, wholistic story, not the single number.


I share your concern, and I worry that we'll find this battle continues 20 years from now.

There are many possible things that can go wrong with a P-value and I'm not a statistician, but things I look for in data are the structure/distribution of the "noise" and any correlations seen within the "noise" and the "signal". That helps you build a signal and noise model. Assuming that all your noise is inherently uncorrelated gaussian, is a pretty strong model assumption.


Use Bayesian statistics [1] :-)

[1] https://en.wikipedia.org/wiki/Bayes_factor


Before interpreting A/B results, the main question that needs to be asked: "what is it that you're A/B testing?"

For too many companies, it's testing "engagement" which leads to hiding functionality (more clicks is more engagement), reducing info density (more time spent is more engagement) etc.

And coming from Netflix... I don't think there's a single person who likes that when you browse Netflix it autoplays random videos (not even trailers) with audio at full volume. But yeah, A/B tests something something. So I wish Netflix learned from their own teachings.


I've A/B tested hiding functionality and reducing info density increased the number of people spending money on the site.

I was completely shocked by seeing those results initially and dove deep to look for any other negative effects from these changes but could not find any. I've repeated similar tests and the results are often similar.

From that experience, I've learned that most people are not like me or the HN crowd. The things that you complain about could actually make things easier for the majority.


I’ve A/B tested this at scale with millions of users across different verticals globally. In many situations, focusing on core functions and clear information presentation helps guide the user to using the site as primarily intended based on their primary intent. It’s tough for most product teams to realize that most users really don’t want all that additional functionality they built.


Yes, you're training users to deal with unnecessary complexity, and then the competitors with less complexity swoop in and take your user base.

Thankfully for businesses with built-in complexity and unfortunately for everyone, other businesses also have the same mindset.


People may not like that feature (I sure don't), but I would bet a decent sum that feature didn't drive increases in negative metrics like churn, and increased positive metrics like hours watched, perhaps by causing people to scroll through more of the library faster, or perhaps drawing people in with the previews. Or maybe they just saw in improvement in a non-core metric like "distance scrolled" with no other negative effects and said, "meh, ship it". Both seem likely.

Of course this is the danger of any sort of behavioral metric driven optimization strategy - you may trade negative customer sentiment for positive business outcomes. That's where the real decision making comes about, i.e. are you willing to make that trade? It seems that in this case, Netflix was.


(disclaimer: I work for Netflix. Edit: I should clarify that I wasn't involved with this article in any way)

You can disable the behavior you mentioned. Go to your profile settings, and under "Playback Settings" you can uncheck "Autoplay previews while browsing on all devices".


> You can disable the behavior you mentioned.

I know I can. But then I can't play trailers for movies/shows where I want to see the trailer.


I am interested to see what they will be testing in some of the upcoming posts in this series. It would be fun to be scrolling Netflix and have the transparency to know that I'm seeing the 'B' test.


It depends on the area of Netflix, but for most recommendation that can fit on a list, Netflix seems to use interleaved testing:

* they create two recommendation lists based on distinct principles or parameters;

* they pick a first (with an A/B test);

* mix those with ‘schoolyard rules’: each side picks the next candidate that wasn’t picked already, creating a new, alternating list.

* They show that to the user and track whether you picked an even or odd item on that recommendation, indicating whether A or B is preferable.

So most like, you are seeing A and B.


Like all controlled experiments though, the experimenter wants to hide that information from the subject (user in this instance) to measure how they respond to the change itself, rather than the change and being told about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: