Aug 2, 2018
6 mins read
Very few statistical concepts have infiltrated the business/marketing world as much as A/B testing. Mention a hypothesis test to an executive and you’ll likely get a blank stare. But recommend doing an A/B test and the reaction is usually mild recognition…almost saying “that’s a word I know, go!” Obviously, an A/B test is a type of hypothesis test, but it’s probably not worth your time educating your CEO about that fact.
A/B testing is a great tool and enjoys broad recognition. That said, because it’s quite common the concept is not universally utilized and sometimes it’s used by folks who don’t fully grasp the concept. Once I was talking with an email marketer about his A/B testing. His strategy was to send two versions of an email to an equal number of recipients. His AB test was watching to see which version got to 100 opens first. They had it setup like a horse race where their email tool would output the number of opens for version A and version B. The team could watch the progress for both and which ever one got to 100 opens first was the winner.
As we discussed his method, I asked what happened if A was at 99 opens when B got to 100, and he just kind of looked at me, shrugged his shoulders, and quoted Dale Earnhardt with “Second place is just the first place loser.”
Now, don’t get me wrong, it always comes down to context. If his race to 100 was being used to figure out which email to use for the next 5000 addresses, it’s probably not a huge deal. If they were planning to peg a quarter million dollar marketing campaign against that result, then getting a valid outcome is a little more important.
Back to the matter at hand…Let’s face it, in general people don’t ‘get’ statistical inference. Most have a hard time using sample data to formulate a reasoned and statistically accurate statement about a population. And, that’s what we’re doing with an A/B test…we’re sampling a portion of our population to figure out which option will be better recieved by the total group.
I’m not claiming to be a statistician. I know just enough about stats to be dangerous. I do try hard to use data in a responsible way so that I don’t embarrass myself in front of my friends who do know stats.
Let’s start with the example above where one email reaches 100 opens when the other is at 99. I’m going to assume that both versions of the email were sent to 1000 addresses. Using the ‘pwr’ package, not surprisingly the p-value is 1.
What does the p-value mean? Here, we need to breifly review the concept of hypothesis testing. In a hypothesis test there are two possible alternatives; the null and the alternate. In this case, the null hypothesis assumes that there is no difference between the two trials. The p-value represents the probably that the null hypothesis is true.
To accept the alternative hypothesis, that the two email open rates are statistically different, we need to see a small p-value.
library(pwr) prop.test(c(100, 99), # Success outcomes c(1000,1000)) # Number of trials
## ## 2-sample test for equality of proportions with continuity correction ## ## data: c(100, 99) out of c(1000, 1000) ## X-squared = 0, df = 1, p-value = 1 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## -0.0262371 0.0282371 ## sample estimates: ## prop 1 prop 2 ## 0.100 0.099
To correctly run a test, one should first calculate the required sample size by doing a power calculation. This is easily done in R using the pwr library, which requires a few parameters: the desired significance level (the false positive rate), the desired statistical power (1-false negative rate), the minimum detectable effect, and the baseline conversion rate cr_a.
mde <- 0.1 # minimum detectable effect cr_a <- 0.25 # the expected conversion rate for group A alpha <- 0.05 # the false positive rate power <- 0.80 # 1-false negative rate ptpt <- pwr.2p.test(h = ES.h(p1 = cr_a, p2 = (1+mde)*cr_a), sig.level = alpha, power = power ) n_obs <- ceiling(ptpt$n) ptpt
## ## Difference of proportion power calculation for binomial distribution (arcsine transformation) ## ## h = 0.05683344 ## n = 4859.916 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: same sample sizes
This result tells us that we need to observe 4860 subjects in each of the A and B test groups if we want to detect a difference of 10% in their conversion rates. Once we have observed that quantity, we can calculate whether there is a statistically significant difference between the two sets of observations via a t-test.
Given the parameters we included in our power calculation, there are two things to be aware of:
there is a 5% chance that our t-test will predict that there is a statistically significant difference when, in fact, there isn’t (a false positive). That is a result of our alpha parameter, which sets a false-positive rate of 5%. there is a 20% chance that the t-test will predict no difference when there actually was a difference (a false negative). This is the false-negative rate (or 1-power), and is commonly referred to as beta.