Statistics behind A/B Testing

A/B testing is one of the primary tools in any data-driven environment. A/B Testing lets you compare several alternate versions of the same reality simultaneously and see which produces the best outcome.

It’s basically testing the effectiveness of different designs to find the optimum solution. These tests are usually performed on live realities with real users who are completely unaware of the test. A/B Testing is a way of conducting an experiment where you compare a control group to the performance of one or more test groups by randomly assigning each group a specific single-variable treatment

Why not test A for a while then B?

Just look at any graph of your outcome over time. Some months can be 30% better than the previous month, then it gets worse again, then better, and so on…outcome are affected by season of the year, sources of users, news events, the state of the economy, competitor activity… You’ll see big differences and big swings even with no changes at all. So if you tried A in a good month then tried B in a bad month you could make an incorrect decision. You’re testing the two versions at the same time in the same season with similar users. It does not have to be 50/50. It could be 90 / 10 or any other ratio.

Building Treatments

Once you know what you want to test you have to create treatments to test it. One of the treatments will be the control. The other treatments will be variations on that. Foe example here are some things worth testing on a website:

• Layout - Move the registration forms around. Add fields, remove fields.

• Headings - Add headings. Make them different colors. Change the copy.

• Copy - Change the size, color, placement, and content of any text you have on the page.

You can have as many treatments as you want, but you get better data more quickly with fewer treatments.

Randomization

You can’t just throw up one test on Friday and another test on Saturday and compare
— there’s no reason to believe that the outcome for users on a Friday is the same for users on a Saturday. To be valid, trials need to be sufficiently large. By tossing a coin 100 or 1000 times one reduce the influence of chance, but even then gets slightly different results with each trial. Similarly, a test may have 30% outcome on Monday, 35% on Tuesday and 28% on Wednesday. This random variation should always be the first cause considered of any change in observed results.

A/B testing solves this by running the experiment in parallel and randomly assigning a treatment each person who visits. This controls for any time-sensitive variables and distributes the population proportionally across the treatments.

Maximizing your outcome is not simply a matter of making changes, it's about making the right changes,at the right time, in the right sequence, and then evaluating the outcomes before continuing the process.

You divide your audience into two groups. You expose one group to the original version of whatever you are testing. You expose the other group to an alternative version, in which only one element has been changed. Then you track the outcomes.

Here are some possibilities to get you started:

Emails: bonus gifts, coupons, messages, guarantees, opening sentence image, closing sentence image, from-field, calls to action, opening greetings, type styles, layout elements, graphic images, etc.

Web Sites: landing pages, language of copy (headings, body, calls to action, assurances), colors, location of elements, look/feel, hyperlinks, etc.

Statistical hypothesis

How hypothesis testing can tell you whether your A/B tests actually effect user behavior, or whether the variations you see are due to random chance. Hypothesis testing is all about quantifying our confidence, so let’s get to it.
Statisticians use something called a null hypothesis to account for this possibility. The null hypothesis for the A/B test might be something like this:

- The difference in conversion between Version A and Version B is caused by random variation. It’s then the job of the trial to disprove the null hypothesis. If it does, we can adopt the alternative explanation:
- The difference in conversion between Version A and Version B is caused by the design differences between the two.

To determine whether we can reject the null hypothesis, we use certain statistical equations to calculate the likelihood that the observed variation could be caused by chance, which include Student’s t test, χ-squared and ANOVA.
Statistical significance

If the arithmetic shows that the likelihood of the result being random is very small (usually below 5%), we reject the null hypothesis. In effect we’re saying “it’s very unlikely that this result is down to chance. Instead, it’s probably caused by the change we introduced” – in which case we say the results are statistically significant.

The Statistics

We need to start with a null hypothesis. In our case, the null hypothesis will be that the outcome of the control treatment is no less than the outcome of our experimental treatment. Mathematically

where pc is the conversion rate of the control and p is the outcome of one of our experiments. The alternative hypothesis is therefore that the experimental object has a higher outcome. This is what we want to see and quantify. The sampled conversion rates are all normally distributed random variables. We have “converts” or “doesn’t convert.” Instead of seeing whether it deviates too far from a fixed percentage we want to measure whether it deviates too far from the control treatment.
Here’s an example representation of the distribution of the control outcome and the treatment outcome.

The peak of each curve is the outcome we measure, but there’s some chance it is actually somewhere else on the curve. Moreover, what we’re really interested in is the difference between the two outcomes. If the difference is large enough we conclude that the treatment really did alter user behavior. So, let’s define a new random variable

then our null hypothesis becomes

Using the random variable X, we need to know the probability distribution of X.

Z-scores and One-tailed Tests
Mathematically the z-score for X is

where N is the sample size of the experimental treatment and Nc is the samle size of the control treatment because the mean of X is p – pc and the variance is the sum of the variances of p and pc.

In this case our null hypothesis is

In other words, we only care about the positive tail of the normal distribution. In this example we only reject the null hypothesis if the experimental outcome is significantly higher than the control outcome, so we have

That is, we can reject the null hypothesis with 95% confidence if the z-score is higher than the tabulated value.