Overview

Define priors

Priors describe what you think about what will happen before it actually happens
When expressed as a $\beta(a,b)$ function, the prior is a probability distribution that describes our uncertainty of the probability of an event happening
- One option for $a,b$ could be the historical positive (a) and negative (b) events
- You can also select $a,b$ to satisfy the mean and variance of the distribution as seen below:

$$ \begin{align} \alpha &= - \frac{\mu (\sigma^2 + \mu^2 - \mu)}{\sigma^2} \\ \beta &= \frac{(\sigma^2 + \mu^2 - \mu) (\mu - 1)}{\sigma^2}. \end{align} $$

pairs = {'Anything possible': (1,1),
         'Maybe 50%': (10,10),
         'Likely 50%': (100,100),
         'Almost certainly 50%': (1000,1000),
         'Maybe 20%': (2, 8),
         'Likely 20%': (20, 80),
         'Almost certainly 20%': (200,800)}

x = np.linspace(0, 1, 1000)
df = pd.DataFrame({f'{name}: ({a=}, {b=})': beta(a,b).pdf(x) for name, (a,b) in pairs.items()}, index=x)
fig = px.line(df, x=df.index, y=df.columns)

Here is an example where we believe our success-to-failure rate is 4:16

success_prior = 4
failure_prior = 16

prior = beta(success_prior, failure_prior)

x = np.linspace(0, 1, 1000)
df = pd.DataFrame({'prior': prior.pdf(x)}, index=x)
fig = px.line(df, x=df.index, y=df.columns)

Get experimental data

Now, go and collect data.
Simulation provided below as an example

n = 200
experiments = {name: np.random.rand(n) < np.random.uniform(0.15, 0.25) for name in ['A', 'B', 'C']}
metrics = {name: {'success': results[results].sum(),
                  'failure': (~results[~results]).sum()} for name, results in experiments.items()}

{'A': {'success': 32, 'failure': 168},
 'B': {'success': 53, 'failure': 147},
 'C': {'success': 49, 'failure': 151}}

Calculate posterior

The posterior for the beta function describes an "update" of your beliefs given new data.
One way to think of this as a new snapshot of data

posteriors = {name: beta(success_prior + metrics[name]['success'],
                         failure_prior + metrics[name]['failure']) for name in experiments}

x = np.linspace(0, 1, 1000)
df = df.assign(**{f'{name}_posterior': posterior.pdf(x) for name, posterior in posteriors.items()})
fig = px.line(df, x=df.index, y=df.columns)

Validate with simulation (MC)

Since distributions are beta, we can't easily subtract the two distributions because the resulting distribution won't be beta distributed
We can instead sample from both distributions and get an estimate for the p-value

n = 100_000

simulations = {name: posterior.rvs(n) for name, posterior in posteriors.items()}

We can also visualize how often variant $B$ was better than variant $A$

result = (simulations['B'] / simulations['A'])
fig = px.histogram(result)
fig.add_vline(x=1, line_color='red', line_width=3, line_dash='dash');

And get a pseudo-pvalue for how better one distribution is from another

pseudo_pvalue = {f'{name1}_gt_{name2}': (s1 > s2).sum() / len(s1)
                        for name1, s1 in simulations.items()
                        for name2, s2 in simulations.items() if name1 != name2}

pseudo_pvalue['B_gt_A']

0.9932

Multiple Comparisons (A/B/n testing)

Statistics do not hold for multiple comparisons
Running tests of new variants against existing would result in p-hacking (as n increases, FDR increases)
Multi-Armed Bandits

df = pd.read_csv('https://raw.githubusercontent.com/alenyeh1014/DataAnalytics-AB_Testing/master/DataFiles/ab_data.csv')
df.head(3)

df2 = df.pivot_table(index='group', columns='converted', aggfunc='count', values='user_id')

df2.sum(axis=1)

group
control      147202
treatment    147276
dtype: int64

(df2.T / df2.sum(axis=1)).T

Synthetic Control Groups

In some cases, it's not reasonable to run a direct A/B test. For example, you might not want to show two different customers different prices for the same item.
Instead, cluster groups of correlated items (e.g. substitutes) together, and change the price for one item in that group.
Check for significant effect by comparing to the counterfactual (other items in the group).

How long do I run my A/B test for?

A/B testing is an $\epsilon$-first strategy in reinforcement learning. See that section.
If taking a Bayesian approach, you can stop whenever you want as long as you have a strong prior.

Multiple Testing Problem

As the number of tests increases, the false discovery rate increases.
Therefore, you shouldn't use the same signifiance level $alpha$ when testing multiple times.

Primacy and Novelty Effects

Some people are reluctant to change (primacy) or inclined to change (novelty)
These effects don't last long. People will either adopt the change, or move on to the next thing.
To resolve:
- Test on users who can't experience those effects (e.g. new users)
- Time lag users in the same experiment to see if the primacy or novelty effects wear off

Interference between variants

Some products have network effects (i.e. as network size increases, value increases; Metcalfe's law)
- If your test showed effect size $a$, you would expect the true effect size $b > a$ because the network value has increased.
Some products have limited resources (e.g. fixed supply of Uber drivers. As the number of Uber drivers with treatment increases, the value per driver decreases)
- If your test showed effect size $a$, you would expect the true effect size $b < a$ at scale because each user gets fractionally fewer resources
To control for interference, segment users:
- Location-based (e.g. Toronto vs. New York). May suffer from issues because of unique markets
- Time-based (e.g. 7-8, 8-9). May suffer from issues because of unique properties at different times of the day. Best to select close-by times, or randomize times.

converted	0	1
group
control	0.879601	0.120399
treatment	0.881080	0.118920

A/B Testing