I'm trying to run an AB test - comparing revenue amongst variants on websites.
Our standard approach (using t-tests) didn't seem like it would work because revenue can't be modelled binomially. However, I read about bootstrapping and came up with the following code:
import numpy as np
import scipy.stats as stats
import random
def resampler(original_array, number_of_samples):
sample_array = np.zeros(number_of_samples)
choice = random.choice
for i in range(number_of_samples):
sample_array[i] = sum([choice(original_array) for _ in range(len(original_array))])
y = stats.normaltest(sample_array)
if y[1] > 0.001:
print y
new_y = resampler(original_array, number_of_samples * 2)
y = new_y
return sample_array
Basically, randomly sample from the 'revenue vector' (a sparsely populated vector - a zero for all non-converting visitors) and sum the resulting vectors until you've got a normal distribution.
I can perform this for both test groups at which point I've got two normally distributed quantities for t-testing. Using scipy.stats.ttest_ind
I was able to get results that looked someway reasonable.
However, I wondered what the effect of running this procedure on cookie split would be (expected each group to see 50% of the cookies). Here, I saw something fairly unexpected - given the following code:
x = [272898,389076,61091,65251,10060,1468815,216014,25863,42421,476379,73761]
y = [274253,387941,61333,65020,10056,1466908,214679,25682,42873,474692,73837]
print stats.ttest_ind(x,y)
I get the output: (0.0021911476165975929, 0.99827342714956546)
Not at all significant (I think I'm interpreting that correctly?)
However, when I run this code:
for i in range(1000, 100000, 5000):
one_array = resampler(x,i)
two_array = resampler(y,i)
t_value, p_value = stats.ttest_ind(one_array, two_array)
t_value_array.append(t_value)
p_value_array.append(p_value)
print np.mean(t_value_array)
print np.mean(p_value_array)
I get: 0.642213492773 0.490587258892
I'm not really sure how to interpret these numbers - as far as I'm aware, I've repeatedly generated normal distributions from the actual cookie splits (each number in the array represents a different site). In each of these cases, I've used a t-test on the two distributions and gotten a t-statistic and a p-value.
Is this a legitimate thing to do? I only ran these tests multiple times because I was seeing so much variation in the p-value and t-statistic when not doing this.
Am I missing an obvious way to run this kind of test?
Cheers,
Matt
p.s
The data we have: Website 1 : test group 1: unique cookies: revenue Website 1 : test group 2: unique cookies: revenue Website 2 : test group 1: unique cookies: revenue Website 2 : test group 2: unique cookies: revenue e.t.c.
What we'd like:
Test group x is beating test group y with z% certainty
(null hypothesis of test group 1 = test group 2)
Bonus:
The same as above but at a per site, as well as overall, basis