Recipes for A/B Testing
Define priors
- Priors describe what you think about what will happen before it actually happens
- When expressed as a $\beta(a,b)$ function, the prior is a probability distribution that describes our uncertainty of the probability of an event happening
- One option for $a,b$ could be the historical positive (a) and negative (b) events
- You can also select $a,b$ to satisfy the mean and variance of the distribution as seen below:
pairs = {'Anything possible': (1,1),
'Maybe 50%': (10,10),
'Likely 50%': (100,100),
'Almost certainly 50%': (1000,1000),
'Maybe 20%': (2, 8),
'Likely 20%': (20, 80),
'Almost certainly 20%': (200,800)}
x = np.linspace(0, 1, 1000)
df = pd.DataFrame({f'{name}: ({a=}, {b=})': beta(a,b).pdf(x) for name, (a,b) in pairs.items()}, index=x)
fig = px.line(df, x=df.index, y=df.columns)
Here is an example where we believe our success-to-failure rate is 4:16
success_prior = 4
failure_prior = 16
prior = beta(success_prior, failure_prior)
x = np.linspace(0, 1, 1000)
df = pd.DataFrame({'prior': prior.pdf(x)}, index=x)
fig = px.line(df, x=df.index, y=df.columns)
n = 200
experiments = {name: np.random.rand(n) < np.random.uniform(0.15, 0.25) for name in ['A', 'B', 'C']}
metrics = {name: {'success': results[results].sum(),
'failure': (~results[~results]).sum()} for name, results in experiments.items()}
posteriors = {name: beta(success_prior + metrics[name]['success'],
failure_prior + metrics[name]['failure']) for name in experiments}
x = np.linspace(0, 1, 1000)
df = df.assign(**{f'{name}_posterior': posterior.pdf(x) for name, posterior in posteriors.items()})
fig = px.line(df, x=df.index, y=df.columns)
n = 100_000
simulations = {name: posterior.rvs(n) for name, posterior in posteriors.items()}
- We can also visualize how often variant $B$ was better than variant $A$
result = (simulations['B'] / simulations['A'])
fig = px.histogram(result)
fig.add_vline(x=1, line_color='red', line_width=3, line_dash='dash');
- And get a pseudo-pvalue for how better one distribution is from another
pseudo_pvalue = {f'{name1}_gt_{name2}': (s1 > s2).sum() / len(s1)
for name1, s1 in simulations.items()
for name2, s2 in simulations.items() if name1 != name2}
pseudo_pvalue['B_gt_A']
df = pd.read_csv('https://raw.githubusercontent.com/alenyeh1014/DataAnalytics-AB_Testing/master/DataFiles/ab_data.csv')
df.head(3)
df2 = df.pivot_table(index='group', columns='converted', aggfunc='count', values='user_id')
df2.sum(axis=1)
(df2.T / df2.sum(axis=1)).T
Synthetic Control Groups
- In some cases, it's not reasonable to run a direct A/B test. For example, you might not want to show two different customers different prices for the same item.
- Instead, cluster groups of correlated items (e.g. substitutes) together, and change the price for one item in that group.
- Check for significant effect by comparing to the counterfactual (other items in the group).
Primacy and Novelty Effects
- Some people are reluctant to change (primacy) or inclined to change (novelty)
- These effects don't last long. People will either adopt the change, or move on to the next thing.
- To resolve:
- Test on users who can't experience those effects (e.g. new users)
- Time lag users in the same experiment to see if the primacy or novelty effects wear off
Interference between variants
- Some products have network effects (i.e. as network size increases, value increases; Metcalfe's law)
- If your test showed effect size $a$, you would expect the true effect size $b > a$ because the network value has increased.
Some products have limited resources (e.g. fixed supply of Uber drivers. As the number of Uber drivers with treatment increases, the value per driver decreases)
- If your test showed effect size $a$, you would expect the true effect size $b < a$ at scale because each user gets fractionally fewer resources
To control for interference, segment users:
- Location-based (e.g. Toronto vs. New York). May suffer from issues because of unique markets
- Time-based (e.g. 7-8, 8-9). May suffer from issues because of unique properties at different times of the day. Best to select close-by times, or randomize times.