I have mixed feelings about using multi-armed bandit for product testing like this. Regret minimization makes sense 100% as a framework if you are testing a large inventory of things - i.e. the classic examples of showing ads or recommendations - since there might be some real opportunity cost in not showing some of the things in inventory (particularly if the inventory has a shelf life). (I'm also quite surprised they don't use thompson sampling...)
For testing product features though, I feel like there is often a high long term cost to the dev team and the regret from showing users a non-optimal treatment during the experiment is pretty minimal (the regret is usually to first order only the cost of experimental bandwidth).
The team cost comes in several subtle forms:
- in practice, bandits encourage lots of small experiments which leave behind a large surface area graveyard of code - you can mitigate this by having strict stopping points for bandit experiments
- bandits have higher statistical power, but also higher false-positive rate; false positives can be quite high cost since they cause thrash and require time to investigate if a feature that tested well does poorly in production
- you are introducing novelty effects over time as new sample groups get added in the dynamic allocation; probably nbd for most experiments, but it's complicated to correct for this if your experiment has novelty effects
- there are often cyclical time-dependent changes in the composition of users being exposed (daytime vs night time, week day vs weekend, geography bc of timezone differences); also, probably nbd for most experiments, but requires complex stratification to correct for if this is an issue
I would also say that the majority of product changes have small, but measurable effects on metrics, so I'm not sure that bandits help all that much in those cases. If there are runaway successes or failures, early stopping techniques seem like a better way to free up resources - early stopping policies can be tuned to address the experiment design problems above fairly simply.
Again, this is all for product testing. I think for recommendations and personalization, contextual bandits make lots of sense.
> bandits have higher statistical power, but also higher false-positive rate; false positives can be quite high cost since they cause thrash and require time to investigate if a feature that tested well does poorly in production
Not sure what you mean by this. Higher false-positive rate compared to what? And given that bandits do not run for a predefined amount of time but converge at a rate proportional to the evidence (as opposed to your typical AB-test), a higher rate at which point in time?
Perhaps you mean that, because bandits typically run longer, there's a higher chance that they'll select an alternative that offers only a marginal improvement on the status quo whereas short experiments would just say "nah, no evidence that one is better than the other" and thereby get rid of a lot of noise?
Thank you - your comment is right and I conflated two things which are conceptually totally different.
For a given number of experiments and block of time (i.e. available samples over time), it's not useful to say that bandits have higher power / a worse FPR, bc the values are adjustable. F1 or AUC would probably be the right way to compare and it seems likely to me that bandits have better performing precision-recall curves. Basically, this is actually irrelevant to the point, and actually favors bandits.
I was totally thinking about the scenario you mentioned where the number of experiments are unconstrained and old experiments run long. Bandits will spend a lot of their bandwidth on very marginal improvements that are below the effect size cutoff that shorter fixed RCT will set. I think you can fix this with early stopping (or just stopping), so maybe it's not really an issue after all.
Setting a good objective function is pretty hard. In this context of consumer goods, it is at the intersection of three difficult problems:
- equivalent to incentivising salespeople, which is known to be very difficult, as short term incentives often are in opposition to long term ones
- distinguishing and dealing with spammers, robots and crawlers
- and setting up a stable reinforcement learning behaviour even for the short term, which is tough even without the first two problems
For these reasons, naturally business partners, designers, and others will be very curious how the bandit affects the customer experience.
Many years ago to solve this I made a system that would emit a list of (suboptimal) rules to exploit the opportunities learnt from small A-B test groups (like an epsilon greedy contextual bandit). These rules were reviewed by relevant stakeholders and then explicitly deployed to production as a configuration change, which allows for manual consideration of issues in the three above areas that are hard to automate.
Producing a set of impactful decision boundaries as functions and then manually curating the functions reminds me how much work maintaining rule based systems can be. Moreover, so much time is spent on figuring out which rules might be helpful in the first place - this being partly what makes rule based systems traditionally brittle (It takes far longer to evolve the rule-based system than to work around the rules).
I really like the idea of models producing functions over values, thanks for sharing that insight.
Stitchfix blog posts are always very smart with a lot of equations but last time there was a article about the company on HN and the comments were saying things like 'I have explicitly told them not to send me white shirts and they keep doing it'.
For testing product features though, I feel like there is often a high long term cost to the dev team and the regret from showing users a non-optimal treatment during the experiment is pretty minimal (the regret is usually to first order only the cost of experimental bandwidth).
The team cost comes in several subtle forms:
- in practice, bandits encourage lots of small experiments which leave behind a large surface area graveyard of code - you can mitigate this by having strict stopping points for bandit experiments
- bandits have higher statistical power, but also higher false-positive rate; false positives can be quite high cost since they cause thrash and require time to investigate if a feature that tested well does poorly in production
- you are introducing novelty effects over time as new sample groups get added in the dynamic allocation; probably nbd for most experiments, but it's complicated to correct for this if your experiment has novelty effects
- there are often cyclical time-dependent changes in the composition of users being exposed (daytime vs night time, week day vs weekend, geography bc of timezone differences); also, probably nbd for most experiments, but requires complex stratification to correct for if this is an issue
I would also say that the majority of product changes have small, but measurable effects on metrics, so I'm not sure that bandits help all that much in those cases. If there are runaway successes or failures, early stopping techniques seem like a better way to free up resources - early stopping policies can be tuned to address the experiment design problems above fairly simply.
Again, this is all for product testing. I think for recommendations and personalization, contextual bandits make lots of sense.