Recipes for A/B Testing

Overview

Define priors

  • Priors describe what you think about what will happen before it actually happens
  • When expressed as a $\beta(a,b)$ function, the prior is a probability distribution that describes our uncertainty of the probability of an event happening
    • One option for $a,b$ could be the historical positive (a) and negative (b) events
    • You can also select $a,b$ to satisfy the mean and variance of the distribution as seen below:
$$ \begin{align} \alpha &= - \frac{\mu (\sigma^2 + \mu^2 - \mu)}{\sigma^2} \\ \beta &= \frac{(\sigma^2 + \mu^2 - \mu) (\mu - 1)}{\sigma^2}. \end{align} $$
pairs = {'Anything possible': (1,1),
         'Maybe 50%': (10,10),
         'Likely 50%': (100,100),
         'Almost certainly 50%': (1000,1000),
         'Maybe 20%': (2, 8),
         'Likely 20%': (20, 80),
         'Almost certainly 20%': (200,800)}

x = np.linspace(0, 1, 1000)
df = pd.DataFrame({f'{name}: ({a=}, {b=})': beta(a,b).pdf(x) for name, (a,b) in pairs.items()}, index=x)
fig = px.line(df, x=df.index, y=df.columns)

Here is an example where we believe our success-to-failure rate is 4:16

success_prior = 4
failure_prior = 16

prior = beta(success_prior, failure_prior)

x = np.linspace(0, 1, 1000)
df = pd.DataFrame({'prior': prior.pdf(x)}, index=x)
fig = px.line(df, x=df.index, y=df.columns)

Get experimental data

  • Now, go and collect data.
  • Simulation provided below as an example
n = 200
experiments = {name: np.random.rand(n) < np.random.uniform(0.15, 0.25) for name in ['A', 'B', 'C']}
metrics = {name: {'success': results[results].sum(),
                  'failure': (~results[~results]).sum()} for name, results in experiments.items()}
{'A': {'success': 32, 'failure': 168},
 'B': {'success': 53, 'failure': 147},
 'C': {'success': 49, 'failure': 151}}

Calculate posterior

  • The posterior for the beta function describes an "update" of your beliefs given new data.
  • One way to think of this as a new snapshot of data
posteriors = {name: beta(success_prior + metrics[name]['success'],
                         failure_prior + metrics[name]['failure']) for name in experiments}

x = np.linspace(0, 1, 1000)
df = df.assign(**{f'{name}_posterior': posterior.pdf(x) for name, posterior in posteriors.items()})
fig = px.line(df, x=df.index, y=df.columns)

Validate with simulation (MC)

  • Since distributions are beta, we can't easily subtract the two distributions because the resulting distribution won't be beta distributed
  • We can instead sample from both distributions and get an estimate for the p-value
n = 100_000

simulations = {name: posterior.rvs(n) for name, posterior in posteriors.items()}
  • We can also visualize how often variant $B$ was better than variant $A$
result = (simulations['B'] / simulations['A'])
fig = px.histogram(result)
fig.add_vline(x=1, line_color='red', line_width=3, line_dash='dash');
  • And get a pseudo-pvalue for how better one distribution is from another
pseudo_pvalue = {f'{name1}_gt_{name2}': (s1 > s2).sum() / len(s1)
                        for name1, s1 in simulations.items()
                        for name2, s2 in simulations.items() if name1 != name2}

pseudo_pvalue['B_gt_A']
0.9932

Multiple Comparisons (A/B/n testing)

  • Statistics do not hold for multiple comparisons
  • Running tests of new variants against existing would result in p-hacking (as n increases, FDR increases)
  • Multi-Armed Bandits
df = pd.read_csv('https://raw.githubusercontent.com/alenyeh1014/DataAnalytics-AB_Testing/master/DataFiles/ab_data.csv')
df.head(3)
df2 = df.pivot_table(index='group', columns='converted', aggfunc='count', values='user_id')
df2.sum(axis=1)
group
control      147202
treatment    147276
dtype: int64
(df2.T / df2.sum(axis=1)).T
converted 0 1
group
control 0.879601 0.120399
treatment 0.881080 0.118920

Synthetic Control Groups

  • In some cases, it's not reasonable to run a direct A/B test. For example, you might not want to show two different customers different prices for the same item.
  • Instead, cluster groups of correlated items (e.g. substitutes) together, and change the price for one item in that group.
  • Check for significant effect by comparing to the counterfactual (other items in the group).

How long do I run my A/B test for?

  • A/B testing is an $\epsilon$-first strategy in reinforcement learning. See that section.
  • If taking a Bayesian approach, you can stop whenever you want as long as you have a strong prior.

Multiple Testing Problem

  • As the number of tests increases, the false discovery rate increases.
  • Therefore, you shouldn't use the same signifiance level $alpha$ when testing multiple times.

Primacy and Novelty Effects

  • Some people are reluctant to change (primacy) or inclined to change (novelty)
  • These effects don't last long. People will either adopt the change, or move on to the next thing.
  • To resolve:
    • Test on users who can't experience those effects (e.g. new users)
    • Time lag users in the same experiment to see if the primacy or novelty effects wear off

Interference between variants

  • Some products have network effects (i.e. as network size increases, value increases; Metcalfe's law)
    • If your test showed effect size $a$, you would expect the true effect size $b > a$ because the network value has increased.
  • Some products have limited resources (e.g. fixed supply of Uber drivers. As the number of Uber drivers with treatment increases, the value per driver decreases)

    • If your test showed effect size $a$, you would expect the true effect size $b < a$ at scale because each user gets fractionally fewer resources
  • To control for interference, segment users:

    • Location-based (e.g. Toronto vs. New York). May suffer from issues because of unique markets
    • Time-based (e.g. 7-8, 8-9). May suffer from issues because of unique properties at different times of the day. Best to select close-by times, or randomize times.