The pitfalls of A/B testing in social networks

dalbasal · on Oct 16, 2017

Many moons ago, OKC had a sort of clever & irreverent blog for stuff like this. It was a great read. There were some classics like "never pay for online dating[1]" and "The case for older women."[2] The first made a convincing case that the cheesy business model (see below for irony) of most dating sites doomed them to suckness. The second made a hilariously quantified (and also convincing) appeal for changing your age preference settings, while actually talking through real social norms and culture.

They managed to deal with touchy dating sterotypes, race, sex. Reading it, you could tell that they were genuinely smart (as opposed to just sophisticated) in their data/analytical mindset. You also got a feel for how the author thinks and where that mindset translates into making OKC.

Anyway, this reads like a linkedin article. Safe, vanilla company blog post. Big contrast. I wonder if the OKC guys writing that old blog 10 years ago new how unique their position was.

[1] Got taken down after they got bought by the target of the post. Accessible here: http://static.izs.me/why-you-should-never-pay-for-online-dat...

[2]https://theblog.okcupid.com/the-case-for-an-older-woman-99d8...

jdk · on Oct 16, 2017

The cofounder of OKC and author of these posts for a long time wrote a book in the vein of the blog: https://www.amazon.com/Dataclysm-Identity-What-Online-Offlin...

dalbasal · on Oct 16, 2017

Nice. Ordered

AznHisoka · on Oct 16, 2017

I read somewhere that they had just 1 guy doing all the data analysis and after they got acquired he/she just was too busy with other work. Thus those type of posts just disappeared.

shostack · on Oct 16, 2017

This is actually a hard thing to manage at some companies. Writers often don't have access or knowledge of the data, and engineers or data scientists often have bigger fish to fry beyond some content marketing explorations.

If anyone has found a good way to find that balance and create really solid in-depth data-driven content for there org, I'd love to learn more.

dalbasal · on Oct 16, 2017

Makes sense. A founder is more likely to feel comfortable enough to take that tone, pick fights and take some risks.

Still, I imagine there's an underlying shift in company character that being expressed. The old blog felt like it was interesting to the writer.

This is in no way a condemnation of the post. I hope I haven't insulted the author.

dredmorbius · on Oct 16, 2017

Discontinuing future work is one thing.

Censoring past published works is quite another.

WorldMaker · on Oct 16, 2017

Semi-related aside, several of the original OKC founders are now at Keybase, and I feel like part of that blog culture you recall fondly is part of why Keybase articles get such high regards here at HN, beyond just the coolness of the tech they are working on.

vacri · on Oct 17, 2017

For [2], the analysis is a bit off. For the mix/max age settings, there's not all that much difference between men and women for max age - for example a 30yo has a max partner age of 36 regardless of gender, and a 45yo has a max partner age of 50 regardless of gender. The difference in settings is in the min setting - and neither men nor women have the same range below as above (men quite a lot more, women a moderate amount less). Men never have an equal settings range above and below, and women don't hit an equal settings range until their 40s.

When it comes to actual messaging activity, both graphs show a 'belly' of preferring to message younger partners (except near the origin); it's just that the men's one is considerably more pronounced. Both genders are sending a lot of messages to people younger than their own minimum settings, too.

In general I liked the OKC data articles, but this particular one always seemed to me like they were forcing a story onto the data.

pougetj · on Oct 16, 2017

I'm currently in the process of getting my phd on this exact topic and I'm always happy to hear the problem of network interference in A/B tests being tackled at more and more tech companies. Here's a small (incomplete) list of (mainly industrial) references to solutions to this problem.

- Facebook data scientists have an extensive list of publications on the topic: [1][2]. (Dean Eckles is now at MIT).

- LinkedIn's experimentation team has also published papers on the topic: [3][4]

- Data scientists at Google have also worked on this topic. One external publication/collaboration I'm aware of is: [5]

[1] http://www.pnas.org/content/113/27/7316

[2] https://arxiv.org/abs/1404.7530

[3] http://www.kdd.org/kdd2017/papers/view/detecting-network-eff...

[4] https://dl.acm.org/citation.cfm?doid=2783258.2788602

[5] http://proceedings.mlr.press/v51/basse16b.pdf

Disclaimer: this list is heavily biased by experiences I've had through collaborations/internships. I am an author on [3]. Edit: formatting.

fragsworth · on Oct 16, 2017

> In conclusion: A/B testing is conceptually simple, but can be difficult to execute if your product involves anything you'd consider "social interaction".

A/B testing is really hard, even when there is no social interaction internal to your application (like a game). The easiest thing to test is marketing for new users. But even that can be tricky, where one group of people might click fewer ads but are higher quality. This is especially troublesome when you can't tell how many friends they invite through word of mouth. And then if you want to implement other features or fix bugs over the duration of the test, that also poisons the results.

And then the number of incoming users you need is massive.

To do a good A/B test that doesn't mislead you is extremely difficult. Almost to the point that for most companies, I don't think it is worth doing.

> So... I guess my final recommendation is that you should hire some data scientists that like doing experiments.

Like this guy says, you basically need a data science team to do it effectively. If you've got a small team (like I work with), don't waste your time with it. You're better off just adding new features and fixing bugs.

jerednel · on Oct 16, 2017

> But even that can be tricky, where one group of people might click fewer ads but are higher quality

This is why I always pull performance metrics in addition to impressions/clicks when running A/B tests. Not being in the media/creative department means I often don't see the creatives i'm asked to analyze the performance of so there are occasionally different creatives that have a stronger resonance with people with a higher propensity to convert. This is itself a learning and could lead to using that sort of creative more for acquisition focused campaigns, rotating it out of the upper funnel ad rotation and coming up with a new upper funnel creative to test for the purposes of building large audience pools.

> This is especially troublesome when you can't tell how many friends they invite through word of mouth.

Could you provide an incentive (gold/credit/whatever) for each new active user referred which could feed into the evaluation criteria of the test as far as the "value" that a particular conversion brought with it? Then I suppose it turns more into a LTV study.

andreasklinger · on Oct 16, 2017

Imo comes down two truths for me:

- it's easy to push the wrong people down the funnel

- use quantitive observation to train your qualified decisions

autokad · on Oct 16, 2017

do the test, but dont get married to the results. every test, even those done best, are evidence in the decision making process, not proof.

FutureSpec · on Oct 16, 2017

I think OkCupid must have fallen victim to getting bought out and watered down. They've been removing features and making bizarre UI decisions for months now and people are not happy.

Many threads like this one:

https://www.reddit.com/r/OkCupid/comments/75r2ay/seriously_o...

slig · on Oct 16, 2017

> I think OkCupid must have fallen victim to getting bought out and watered down

I'm betting that they run tests for everything and change into whatever direction the results tells them. If people like "slide to the left/right" and a simple UI, no walls of text, quizzes, etc, so be it.

carlsborg · on Oct 17, 2017

The A/B testing example of 500 word limit is a great example of what Soros calls Reflexivity in his books on markets.

wyck · on Oct 16, 2017

Here's an idea, don't trust your users and use your intuition.