1

I have a dataframe that looks like thus:

    user_id reordered
0   128305  1
1   128305  1
2   128305  1
3   128305  0
4   128305  0
5   128305  0
6   128305  0
7   128305  0
8   144548  1
9   144548  1
10  144548  1
**etc...**

In this case, it's recording each product that a user ordered, and recording whether or not said ordering was in fact a re-order. What I need to do is figure out the ratio of new orders to re-orders per customer. I thought to do this by "extracting" the values where 'reordered' = 0, bringing them into a new dataframe, grouping both dataframes by the count of user_id, then bringing the two dataframes back together so that I could properly find the ratio of one to the other. However, since I had to sum them up before rejoining them, they no longer had user_id as an actual column, meaning that they could not be joined together simply. I got around this by cross joining the two on a temporary variable, which sort of worked...except that resulted in the data for one of the frames being repeated, uselessly. To visualize what's happening, this is what I want to get:

    Total Products (formed by performing a groupBy(user_id) on the original dataframe
user_id 
6550    23
27523   51
43048   40


New Orders (formed by performing a groupBy() on a dataframe containing only those rows where reordered = 0)
user_id 
6550    18
27523   25
43048   32

coming together to form:

user_id  TotalOrders  NewOrders
6550        23           18
27523       51           25
43048       40           32

and here's what I'm actually getting:

   Total Products   New Orders
0       23                 18
1       23                 25
2       23                 32

As you can see, only the first row (23 - 18) should be here - the other two are a result of the entire New Orders column being posted once for each value in the Total Products column. Is there a way around this? Is my initial groupBy()-ing to blame?

Thanks.

0 Answers0