Let's say we have two tables, trans
and product
. Hypothetically the trans
table consists of over a billion rows of purchases bought by users.
I am trying to find paired products that are often purchased together(purchased on the same date) by the same user, such as wine and bottle openers, chips and beer, etc..
I am trying to find the top five paired products and their names.
trans and prod dataframe :-
trans = {'ID':[1,1,2,2,3,3,1,5,5,6,6,6],
'productID':[11,22,11,22,33,77,11,77,88,11,22,77],
'Year':['2022-01-01','2022-01-01','2020-01-05','2020-01-05','2019-01-01','2019-01-01','2020-01-07','2020-01-08',
'2020-01-08','2021-06-01','2021-06-01','2021-06-01']}
trans = pd.DataFrame(trans)
trans['Year'] = pd.to_datetime(trans['Year'])
trans
product = {'productID':[11,22,33,44,55,77,88],
'prodname':['phone','Charger','eaphones','headset','scratchgaurd','pin','cover']}
product = pd.DataFrame(product)
product
My code till now where was trying to Rank the items with same ID and Year and then try to get the product names.
transprod = pd.merge(trans,product,on='productID' , how='inner')
transprod
transprod['Rank'] = transprod.groupby('ID')['Year'].rank(method = 'dense').astype(int)
transprod = transprod.sort_values(['ID','productID','Rank'])
transprod
Desired Output:
Product 1 | Product 2 | Count
phone charger 3
Charger pin 1
eaphones pin 1
pin cover 1
Any help is really appreciated. Thanks in advance