0

Working in python with pandas, I am trying to assign control and treatment groups to different groups of customers.

I have a large dataset. Instead of giving an example of the data, let me show you the pivot, since this summarizes the most important data.

pd.pivot_table(df,index=['Test Group'],values=["Customer_ID"],aggfunc=lambda x: len(x.unique()))

I get those counts Test Group Customer_ID

Innovators 4634
Early Adopters 2622
Early Majority 8653
Late Majority 7645
Laggards 7645
Lost 4354
Prospective 653

I run the following code:

percentages = {'Innovators':[0.0,1.0],\
     'Early Adopters':[0.2,0.8], \
     'Early Majority':[0.1,0.9],\
     'Late Majority':[0.0,1.0],\
     'Laggards':[0.2,0.8],\
     'Lost':[0.1,0.9],\
     'Prospective':[0.1,0.9]}

def assigner(gp):
     ...:     group = gp['Test Group'].iloc[0]
     ...:     cut = pd.qcut(
                  np.arange(gp.shape[0]), 
                  q=np.cumsum([0] + percentages[group]), 
                  labels=range(len(percentages[group]))
              ).get_values()
     ...:     return pd.Series(cut[np.random.permutation(gp.shape[0])], index=gp.index, name='flag')

df['flag'] = df.groupby('Test Group', group_keys=False).apply(assigner)

ValueError: Bin edges must be unique: array([   0,    0, 2621], dtype=int64).
You can drop duplicate edges by setting the 'duplicates' kwarg

... and keep on getting this error

I found this answer, which could be helpful How to qcut with non unique bin edges? ; but rank dowsn't work for np

def assigner(gp):
     ...:     group = gp['Campaign Test Description'].iloc[0]
     ...:     cut = pd.qcut(
                  np.arange(gp.shape[0]).rank(method='first'), 
                  q=np.cumsum([0] + percentages[group]), 
                  labels=range(len(percentages[group]))
              ).get_values()
     ...:     return pd.Series(cut[np.random.permutation(gp.shape[0])], index=gp.index, name='flag')

AttributeError: 'numpy.ndarray' object has no attribute 'rank'

I tried dropping duplicates

def assigner(gp):
     ...:     group = gp['Campaign Test Description'].iloc[0]
     ...:     cut = pd.qcut(
                  np.arange(gp.shape[0]), 
                  q=np.cumsum([0] + percentages[group]), 
                  labels=range(len(percentages[group])),duplicates='drop'
              ).get_values()
     ...:     return pd.Series(cut[np.random.permutation(gp.shape[0])], index=gp.index, name='flag')

ValueError: Bin labels must be one fewer than the number of bin edges

Still getting an error

jeangelj
  • 4,338
  • 16
  • 54
  • 98
  • What does the output look like? Just different splits for each Customer_ID (100:0. 80:20 etc). Perhaps assign a random number between 0 and 1 for every row, and then use cut-offs based on Customer_ID to get test flag. – Stev Feb 15 '18 at 21:10
  • yes, that's what the output looks like; but should qcut do that for me? – jeangelj Feb 15 '18 at 21:11
  • I'm not really familiar with that method but here is how I would tackle the problem: `df_pct = pd.DataFrame({ 'ID': ['Innovators','Early Adopters' ,'Early Majority','Late Majority','Laggards','Lost','Prospective'], 'test_cutoff':[1,0.8,0.9,0.1,0.8,0.9,0.9]})` `df=df.merge(df_pct)` `df['is_test'] = np.random.uniform(0, 1, len(df)) >= df['test_cutoff']` – Stev Feb 15 '18 at 21:35
  • do you mind posting this as an answer? If it works, I will upvote and accept – jeangelj Feb 15 '18 at 21:37

1 Answers1

1

You are doing a train/test split, which is commonly used in machine learning. Here is a way to do it (double check that I have your percentages the right way around):

df_pct = pd.DataFrame({ 'ID': ['Innovators','Early Adopters' ,'Early Majority','Late Majority','Laggards','Lost','Prospective'], 'test_cutoff':[1,0.8,0.9,0.1,0.8,0.9,0.9]}) df=df.merge(df_pct) df['is_test'] = np.random.uniform(0, 1, len(df)) >= df['test_cutoff']

Also, your 'Late Majority' percentages don't add up to 100.

Stev
  • 1,100
  • 9
  • 16
  • typo, it was supposed to be 1.0 and not 0.1 – jeangelj Feb 15 '18 at 21:43
  • this actually worked, let me test it further if the cutoffs are correct and I will accept it, thank you – jeangelj Feb 15 '18 at 21:48
  • Don't sound so surprised! By the way, I found that simple train/test split from [here](https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/) a while ago. It seems to have been updated recently. You can use something like `df.groupby(['ID', 'is_test']).size()` to make sure the IDs are being grouped as you expect. – Stev Feb 15 '18 at 21:54
  • sorry! it's just that np generally doesnt split exactly – jeangelj Feb 15 '18 at 22:40
  • yeah, the cutoffs arent exact, but it works and its far cleaner – jeangelj Feb 15 '18 at 22:41
  • No problem, happy to help. Don't forget to set `random.seed()` if you want to repeat your group classification exactly, otherwise if you rerun the code, you will end up with a slightly different group classification due to the random numbers. – Stev Feb 15 '18 at 22:44