I have a dataframe that looks like thus:
user_id reordered
0 128305 1
1 128305 1
2 128305 1
3 128305 0
4 128305 0
5 128305 0
6 128305 0
7 128305 0
8 144548 1
9 144548 1
10 144548 1
**etc...**
In this case, it's recording each product that a user ordered, and recording whether or not said ordering was in fact a re-order. What I need to do is figure out the ratio of new orders to re-orders per customer. I thought to do this by "extracting" the values where 'reordered' = 0, bringing them into a new dataframe, grouping both dataframes by the count of user_id, then bringing the two dataframes back together so that I could properly find the ratio of one to the other. However, since I had to sum them up before rejoining them, they no longer had user_id as an actual column, meaning that they could not be joined together simply. I got around this by cross joining the two on a temporary variable, which sort of worked...except that resulted in the data for one of the frames being repeated, uselessly. To visualize what's happening, this is what I want to get:
Total Products (formed by performing a groupBy(user_id) on the original dataframe
user_id
6550 23
27523 51
43048 40
New Orders (formed by performing a groupBy() on a dataframe containing only those rows where reordered = 0)
user_id
6550 18
27523 25
43048 32
coming together to form:
user_id TotalOrders NewOrders
6550 23 18
27523 51 25
43048 40 32
and here's what I'm actually getting:
Total Products New Orders
0 23 18
1 23 25
2 23 32
As you can see, only the first row (23 - 18) should be here - the other two are a result of the entire New Orders column being posted once for each value in the Total Products column. Is there a way around this? Is my initial groupBy()-ing to blame?
Thanks.