I have a dataset that looks similar to this
USER_ID VARIANT_NAME REVENUE
0 737 variant 1.5
1 2423 control 0.0
2 9411 control 0.0
3 7311 control 2.3
4 6174 variant 0.0
Now I have use the following to do a sampling distribution:
#Number of users in control group
n_con = df_control.shape[0]
#Number of users in variant group
n_var = df_variant.shape[0]
#Probability of generating revenue in control group
p_con = df_con_n0.shape[0] / df_control.shape[0]
#Probability of generating revenue in variant group
p_var = df_var_n0.shape[0] / df_variant.shape[0]
p_diffs = []
for i in range(10000):
var_converted = np.random.choice([1,0],size=n_var,p=(p_var,(1-p_var)))
var_unique , var_count = np.unique(var_converted,return_counts=True)
var_ncon , var_con = np.split(var_count,2)
con_converted = np.random.choice([1,0],size=n_con,p=(p_con,(1-p_con)))
con_unique , con_count = np.unique(con_converted,return_counts=True)
con_ncon , con_con = np.split(con_count,2)
var_con_p = int(var_con)/ int(var_con+var_ncon)
con_con_p = int(con_con)/ int(con_con+con_ncon)
p_diffs.append(var_con_p - con_con_p)
p_diffs
This works great when creating a sampling distribution based on if the user generated revenue or not. However, I would like to do one based on the average revenue, create a sampling distribution based on avg revenue instead if more users in a group generated more revenue or not.
Cheers!