Probability Applied to Landing Page Testing

So how does probability apply to landing page optimization?
The random variables are the visits to your site from the traffic sources that you have selected for the test. The audience itself may be subject to sampling bias. You are counting whether or not the conversion happened as a result of the visit. You are assuming that there is some underlying and fixed probability of the conversion happening, and that the only other possible outcome is that the conversion does not happen (that is, a visit is a Bernoulli random variable that can result in conversion, or not).
As an example, let's assume that the actual conversion rate for a landing page is 2%. Hence there is a larger chance that it will not convert (98%) for any particular visitor. As you can see, the sum of the two possible outcome probabilities exactly equals 1 (2% + 98% = 100%) as required.
The stochastic process is the flow of visitors from the traffic sources used for the test. Key assumptions about the process are that the behavior of the visitors does not change over time, and that the population from which visitors are drawn remains the same. Unfortunately, both of these are routinely violated to a greater or lesser extent in the real world. The behavior of visitors changes due to seasonal factors, or with changing sophistication and knowledge levels about your products or industry. The population itself changes based on your current marketing mix. Most businesses are constantly adjusting and tweaking their traffic sources (e.g., by changing PPC bid prices and the resulting keyword mix that their audience arrives from). The result is that your time series, which is supposed to return a steady stream of yes or no answers (based on a fixed probability of a conversion), actually has a changing probability of conversion. In mathematical terms, your time series is nonstationary and changes its behavior over time.
The independence of the random variables in the stochastic process is also a critical theoretical requirement. However, the behavior on each visit is not necessarily independent. A person may come back to your landing page a number of times, and their current behavior would obviously be influenced by their previous visits. You might also have a bug or an overload condition where the actions of some users influence the actions that other users can take. For this reason it is best to use a fresh stream of visitors (with a minimal percentage of repeat visitors if possible) for your landing page test audience. Repeat visitors are by definition biased because they have voluntarily chosen to return to your site, and are not seeing it for the first time at random. This is also a reason to avoid using landing page testing with an audience consisting of your in-house e-mail list. The people on the list are biased because they have self-selected to receive ongoing messages from you, and because they have already been exposed to previous communications.
The event itself can also be more complicated than the simple did-the-visitor-convert determination. In an e-commerce catalog, it is important to know not only whether a sale happened, but also its value. If you were to tune only for higher conversion rate, you could achieve that by pushing low-margin and low-cost products that people are more likely to buy. But this would not necessarily result in the highest profits.
Statistical Methods
Landing page testing is a form of experimental study. The environment that you are changing is the design of your landing page. The outcome that you are measuring is typically the conversion rate. Landing page testing and tuning is usually done in parallel, and not sequentially. This means that you should split your available traffic and randomly alternate the version of your landing page shown to each new visitor. A portion of your test traffic should always see the original version of the page. This will eliminate many of the problems with sequential testing.
Observational studies, by contrast, do not involve any manipulation or changes to the environment in question. You simply gather the data and then analyze it for any interesting correlations between your independent and dependent variables.
For example, you may be running PPC marketing programs on two different search engines. You collect data for a month on the total number of clicks from each campaign and the resulting number of conversions. You can then see if the conversion rate between the two traffic sources is truly different or possibly due to chance.
Descriptive statistics only summarize or describe the data that you have observed. They do not tell you anything about the meaning or implications of your observations. Proper hypothesis testing must be done to see if differences in your data are likely to be due to random chance or are truly significant.

Have I Found Something Better?
Landing page optimization is based on statistics, and statistics is based in turn on probability theory. And probability theory is concerned with the study of random events. But a lot of people might object that the behavior of your landing page visitors is not "random." Your visitors are not as simple as the roll of a die. They visit your landing page for a reason, and act (or fail to act) based on their own internal motivations.
So what does probability mean in this context? Let's conduct a little thought experiment.
I have flipped the coin and covered up the result after catching it in my hand. Now imagine if I peeked at the coin without letting you see it. What would you estimate the probability of it coming up heads to be? Still 50%, right? How about me? I would no longer agree with you. Having seen the outcome of the flip event I would declare that the probability of coming up heads is either zero or 100% (depending on what I have seen).
How can we experience the same event and come to two different conclusions? Who is correct? The answer is—both of us. We are basing our answers on different available information. Let's look at this in the context of the simplest type of landing page optimization. Let's assume that you have a constant flow of visitors to your landing page from a steady and unchanging traffic source. You decide to test two versions of your page design, and split your traffic evenly and randomly between them.
In statistical terminology, you have two stochastic processes (experiences with your landing pages), with their own random variables (visitors drawn from the same population), and their own measurable binary events (either visitors convert or they do not). The true probability of conversion for each page is not known, but must be between zero and one. This true probability of conversion is what we call the conversion rate and we assume that it is fixed.
From the law of large numbers you know as you sample a very large number of visitors, the measured conversion rate will approach the true probability of conversion. From the Central Limit Theorem you also know that the chances of the actual value falling within three standard deviations of your observed mean are very high (99.7%), and that the width of the normal distribution will continue to narrow (depending only on the amount of data that you have collected). Basically, measured conversion rates will wander within ever narrower ranges as they get closer and closer to their true respective conversion rates. By seeing the amount of overlap between the two bell curves representing the normal distributions of the conversion rate, you can determine the likelihood of one version of the page being better than the other.
One of the most common questions in inferential statistics is to see if two samples are really different or if they could have been drawn from the same underlying population as a result of random chance alone. You can compare the average performance between two groups by using a t-test computation. In landing page testing, this kind of analysis would allow you to compare the difference in conversion rate between two versions of your site design. Let's suppose that your new version had a higher conversion rate than the original. The t-test would tell you if this difference was likely due to random chance or if the two were actually different.
There is a whole family of related t-test formulas based on the circumstances. The appropriate one for head-to-head landing page optimization tests is the unpaired one-tailed equal-variance t-test. The test produces a single number as its output. The higher this number is, the higher the statistical certainty that the two outcomes being measured are truly different.

Collecting Insufficient Data
Early in an experiment when you have only collected a relatively small amount of data, the measured conversion rates may fluctuate wildly. If the first visitor for one of the page designs happens to convert, for instance, your measured conversion rate is 100%. It is tempting to draw conclusions during this early period, but doing so commonly leads to error. Just as you would not conclude a coin could never come up tails after seeing it come up heads just three times, you should not pick a page design before collecting enough data.
The laws of probability only guarantee the accuracy and stability of results for very large sample sizes. For smaller sample sizes, a lot of slop and uncertainty remain.
The way to deal with this is to decide on your desired confidence level ahead of time. How sure do you want to be in your answer—90%, 95%, 99%, even higher? This completely depends on your business goals and the consequences of being wrong. If a lot of money is involved, you should probably insist on higher confidence levels.
Let's consider the simplest example. You are trying to decide whether version A or B is best. You have split your traffic equally to test both options and have gotten 90 conversions on A, and 100 conversions on B. Is B really better than A? Many people would answer yes since 100 is obviously higher than 90. But the statistical reality is not so clear-cut.
Confidence in your answer can be expressed by means of a Z-score, which is easy to calculate in cases like this. The Z-score tells you how many standard deviations away from the observed mean your data is. Z=1 means that you are 67% sure of your answer, Z=2 means 95.28% sure, and Z=3 means 99.74% sure.
Pick an appropriate confidence level, and then wait to collect enough data to reach it.
Let's pick a 95% confidence level for our earlier example. This means that you want to be right 19 out of 20 times. So you will need to collect enough data to get a Z-score of 2 or more.
The calculation of the Z-score depends on the standard deviation ([.sigma]). For conversion rates that are less than 30%, this formula is fairly accurate:
In our example for B, the standard deviation would be 10
So we are 67% sure (Z=1) that the real value of B is between 90 and 110 (100 plus or minus 10). In other words, there is a one out of three chance that A is actually bigger than the lower end of the estimated range, and we may just be seeing a lucky streak for B.
Similarly at our current data amounts we are 95% sure (Z=2) that the real value of B is between 80 and 120 (100 plus or minus 20). So there is a good chance that the 90 conversions on A are actually better than the bottom end estimate of 80 for B.
Confidence levels are often illustrated with a graph. The error bars on the quantity being measured represent the range of possible values (the confidence interval) that would be including results within the selected confidence level. Figure 1: shows 95% confidence error bars (represented by the dashed lines) for our example. As you can see, the bottom of B's error bars is higher than the top of A's error bars. This implies that A might actually be higher than B, despite B's current streak of good luck in the current sample.

 Figure 1: Confidence error bars (little data)

If we wanted to be 95% sure that B is better than A, we would need to collect much more data. In our example, this level of confidence would be reached when A had 1,350 conversions and B had 1,500 conversions. Note that even though the ratio between A and B remains the same, the standard deviations have gotten much smaller, thus raising the Z-score. As you can see from Figure 2, the confidence error bars have now "uncrossed," so you can be 95% confident that B actually is better than A.

Figure 2: Confidence error bars (more data)