Running AB tests on Revenue in Python

Question

I'm trying to run an AB test - comparing revenue amongst variants on websites.

Our standard approach (using t-tests) didn't seem like it would work because revenue can't be modelled binomially. However, I read about bootstrapping and came up with the following code:

import numpy as np
import scipy.stats as stats
import random

def resampler(original_array, number_of_samples):
    sample_array = np.zeros(number_of_samples)
    choice = random.choice
    for i in range(number_of_samples):
        sample_array[i] = sum([choice(original_array) for _ in range(len(original_array))])

    y = stats.normaltest(sample_array)
    if y[1] > 0.001:
        print y
        new_y = resampler(original_array, number_of_samples * 2)
        y = new_y
    return sample_array

Basically, randomly sample from the 'revenue vector' (a sparsely populated vector - a zero for all non-converting visitors) and sum the resulting vectors until you've got a normal distribution.

I can perform this for both test groups at which point I've got two normally distributed quantities for t-testing. Using scipy.stats.ttest_ind I was able to get results that looked someway reasonable.

However, I wondered what the effect of running this procedure on cookie split would be (expected each group to see 50% of the cookies). Here, I saw something fairly unexpected - given the following code:

x = [272898,389076,61091,65251,10060,1468815,216014,25863,42421,476379,73761]
y = [274253,387941,61333,65020,10056,1466908,214679,25682,42873,474692,73837]
print stats.ttest_ind(x,y)

I get the output: (0.0021911476165975929, 0.99827342714956546)

Not at all significant (I think I'm interpreting that correctly?)

However, when I run this code:

for i in range(1000, 100000, 5000):
    one_array = resampler(x,i)
    two_array = resampler(y,i)
    t_value, p_value = stats.ttest_ind(one_array, two_array)
    t_value_array.append(t_value)
    p_value_array.append(p_value)

print np.mean(t_value_array)
print np.mean(p_value_array)

I get: 0.642213492773 0.490587258892

I'm not really sure how to interpret these numbers - as far as I'm aware, I've repeatedly generated normal distributions from the actual cookie splits (each number in the array represents a different site). In each of these cases, I've used a t-test on the two distributions and gotten a t-statistic and a p-value.

Is this a legitimate thing to do? I only ran these tests multiple times because I was seeing so much variation in the p-value and t-statistic when not doing this.

Am I missing an obvious way to run this kind of test?

Cheers,

Matt

p.s

The data we have: Website 1 : test group 1: unique cookies: revenue Website 1 : test group 2: unique cookies: revenue Website 2 : test group 1: unique cookies: revenue Website 2 : test group 2: unique cookies: revenue e.t.c.

What we'd like:

Test group x is beating test group y with z% certainty

(null hypothesis of test group 1 = test group 2)

Bonus:

The same as above but at a per site, as well as overall, basis

score 1 · Answer 1 · answered Mar 11 '14 at 13:38

Firstly, using a t-test to test binomial response variables isn't correct. You need to use a logistic regression model.

On to your question. It's very hard to read that code and understand what you think you're testing---what's your H_0 (null hypothesis)? If I'm being honest (and I hope you don't take offense) it looks pretty confused.

I'm going to have to guess what the data look like---you have a bunch of samples like this:

Website   Method     Revenue
-------   ------     -------
w1        A          12
w2        B          0
w3        A          6
w4        B          0

etc etc. Does this look correct? Do you have repeated measures (i.e. do you have a revenue measurement for each website for each method? Or did you randomly assign websites to methods?)? I'm guessing that what you're passing to your method is an array of all revenues for one of the methods in turn, but do they pair up across methods in any way?

I can imagine testing various hypotheses with this data. For example, is method A more likely to generate non-zero revenue than method B (use logistic regression, response is binary)? Of the cases where a method generates revenue at all, does method A generate more than method B (t-test on non-zero revenues)? Does method A generate more revenue than method B across all instances (probably a sign test, due to problems with the assumption of normality when you include the zeros). I assume this troubling assumption is why you run the procedure of repeatedly subsampling until your data look normal, but you can't do this and test anything meaningful: just because some subset of your data is normally distributed doesn't mean you can look at only this part of it! In fact, I wouldn't be surprised to see that what this essentially does is excludes either most of the zero entries or most of the non-zero entries.

If you elaborate with what some of the actual data look like, and what questions you want to answer, I'm happy to make more specific suggestions.

Thanks @Ben-Allison for helping out - have edited my question to show the data we have and ultimately what we'd like to calculate. Would love to get your input! — Kali_89, Mar 11 '14 at 20:10
OK, so that looks better. Do you have a revenue entry for each visitor to each site, or just totals per site? That's the last piece of the puzzle, then I can help out with a solution — Ben Allison, Mar 12 '14 at 21:26
I've got the revenue entry for each visitor. My current best bet is the Mann-Whitney test but I'm still very much looking for steer! — Kali_89, Mar 12 '14 at 21:54

Running AB tests on Revenue in Python

1 Answers1