When can I stop my split test? How much traffic do I need for my A/B test? Can I trust my test data? Get the answer to these common questions and a basic introduction to test validation and determining the statistical significance of your A/B split tests.
The only thing that’s worse than not testing, is relying on bad data. In order to conduct experiments that provide real value, you have to be familiar with the basics factors: Statistical Confidence, Conversion Range, and Sample Size.
In this 10-minute video, I’ll give you a basic introduction to finding out how reliable your test data is.
Hello I’m Michael Aagaard – thank you for watching this short video on how to determine the statistical significance of an A/B split test.
Today I’m going to go over 3 basic factors that are essential to establishing the reliability of your test results. These 3 factors are:
1. Confidence Level
2. Conversion range
3. Sample size
Test validation and statistics are some the less sexy aspects of testing – nevertheless, they are extremely important because there really is no point in testing, if you can’t rely on your tests results.
The big problem for most marketers is that they either pay no attention to these 3 factors, or focus only on 1 of these factors.
But you really need to be aware of all three factors in order to perform valid experiments that provide true and lasting value to your online business.
The point of performing an A/B split test is to get answers so you can base your decisions on data rather than gut feeling and guesswork. So if you can’t rely on your data – then it really defeats the purpose of performing the in the first place.
What test validation is all about is finding out whether the tendencies you are seeing are a reliable representation of how the variant will perform – or whether the tendencies are simply random. That’s where the three basic factors I mentioned before come into the picture.
They help you determine the likelihood that e.g. A is in fact better than B.
A statistically significant test result is one that in all possible likelihood indicates that we have an actual winner.
Ok – so let’s look at the 3 factors one by one. We’ll start with looking at confidence level.
Statistical confidence measures how many times out of 100 that test results can be expected to be within a specified range. A confidence level of 99% means that that the results will probably meet expectations 99 times out of 100.
In other words – a 99% confidence level means that there is 1% chance that numbers are off. And a confidence level of let’s say 60% means that there is 40% chance that numbers are off. So, if you stop a test at e.g. 60% you willing accept a 40% risk that numbers are off.
Confidence level is by far the most commonly used and known factor. It is an extremely important factor, but is in no way enough to guarantee reliable results. You need to look at the two other factors standard error and sample size as well.
Let’s move on to conversion range
Conversion Range shows you the range within which the actual conversion rate may lie.
You’ll find the conversion rate for each variant here.
The small +- sign and the number represent the standard error.
In this case, the standard error is 1% and means that the conversion range for the control variation is 7.95% plus minus 1%. Which again means that the actual conversion rate is somewhere between 6.95% to 8.95%.
For variation 1 the conversion range is 11.08% plus minus 1%.
So, the conversion range can be described as the margin of error you’re willing to accept. The smaller the conversion range – the more accurate your results will be. As a rule of thumb – if the 2 conversion ranges overlap, you’ll need to keep testing in order to get a valid result. In this case, if we add the standard error (1%) to the lowest conversion rate (that of the control) and subtract 1% from the highest conversion rate (that of variation 1) we’ll see that the two ranges don’t overlap. So this is a good sign that variation 1 will in fact perform better than the control.
Ok let’s move on to sample size.
Sample size represents the number of visitors that have been part of your test and how many conversions they have performed.
The reliability of your data increases as you increase the number of data points. In other words – the larger the sample size, the more reliable your results will be. It is pretty much common sense that the more people you include in a test – the more representative the results will be. There’s a correlation between sample size and conversion range. And as your sample size increases, your conversion range will decrease.
Here’s an example of a test with a small sample size of 73 visits and 8 conversions. Here you’ll see that the conversion range for the control is 5.88% plus minus 5% and 15.38% plus minus 7% for variation 1. This means that the actual conversion rate for the control is somewhere between 0.88% and 10.88% – for variation 1 it is somewhere between 8.38% and 22.38%.
It doesn’t take a rocket scientist to see that these ranges overlap quite a bit and that you would need a larger sample size in order to get reliable results, and therefore concluding anything at this point involves quite a risk. But what often happens is that marketers get overexcited about results like these and jump to conclusions and assume that they have a winner. When in fact all they have is a 91% chance that the conversion ranges for the individual variations are accurate.
So – how large a sample do you need in order to achieve significance? Well, in theory you can’t define that number. It depends completely on the individual test. But as a rule of thumb, you can say that the bigger the difference in performance is between the 2 variations – the smaller a sample size you will need in order to a reliable result. And vice versa. So with a dramatic difference in performance, you’ll need a smaller sample, and with a minor difference in performance, you’ll need a larger sample.
In my experience, a lot can happen within the first 100 conversions. So my rule of thumb is to get at least 100 conversions – conversions not visits – before I conclude anything.
Also, a great tip when your trying to validate your test results is to look at graph that graphically depicts the development of the test. If you see a lot of fluctuations or diamonds shapes where the variations cross each other – that’s a sign that you need a larger sample (or that there might not be a significant difference between the variants).
On the other hand, if you see a nice clear tendency that one variant is outperforming the other, that’s a great indication that your results are reliable and that you’ll find and actual winner.
Be aware that fluctuations are natural in the beginning of a test period. When the sample size is small – small changes will have large impact.
Ok so let’s do a quick summary and get some guidelines here.
– Get as close to 99% statistical significance as possible
– Sample size of at least 100 conversions
– Conversion Range of <±1%
– Look for fluctuations (diamond shapes)
If you are aware of these factors and use these guidelines, you will with certainty get more reliable and valuable tests results.
But the best tip I can give is: “Don’t jump the gun” and get excited over premature tests results
I hear marketers complain that their testing tools are off or don’t work, but in most cases is not the testing tool that’s the problem – it’s the person interpreting the test data. Like with so many other things – the tool is only as good as the person using it.
Ok cool – now that you are familiar with the 3 basic elements and how to determine the statistical significance of your test results, it’s time to get cracking on some more experiments.
Thanks for watching and see you next time!