Python pandas how to group and match

Question

Let's say we have two tables, trans and product. Hypothetically the trans table consists of over a billion rows of purchases bought by users.

I am trying to find paired products that are often purchased together(purchased on the same date) by the same user, such as wine and bottle openers, chips and beer, etc..

I am trying to find the top five paired products and their names.

trans and prod dataframe :-

trans = {'ID':[1,1,2,2,3,3,1,5,5,6,6,6],
        'productID':[11,22,11,22,33,77,11,77,88,11,22,77],
        'Year':['2022-01-01','2022-01-01','2020-01-05','2020-01-05','2019-01-01','2019-01-01','2020-01-07','2020-01-08',
                '2020-01-08','2021-06-01','2021-06-01','2021-06-01']}
trans = pd.DataFrame(trans)
trans['Year'] = pd.to_datetime(trans['Year'])
trans


product = {'productID':[11,22,33,44,55,77,88],
        'prodname':['phone','Charger','eaphones','headset','scratchgaurd','pin','cover']}
product = pd.DataFrame(product)
product

My code till now where was trying to Rank the items with same ID and Year and then try to get the product names.

transprod = pd.merge(trans,product,on='productID' , how='inner')
transprod


transprod['Rank'] = transprod.groupby('ID')['Year'].rank(method = 'dense').astype(int)
transprod = transprod.sort_values(['ID','productID','Rank'])
transprod

Desired Output:

Product 1 | Product 2 | Count
phone       charger      3
Charger      pin         1
eaphones     pin         1
pin         cover        1

Any help is really appreciated. Thanks in advance

Arne · Accepted Answer · 2022-01-13T01:42:47.770

You could group the transactions table by ID (and date) and list all product pairs for each order. itertools.combinations is useful here. By taking the set over an order first, you can ignore multiple equal items.

Since it does not matter in which order a pair appears, you could then construct a flat list of all the pairs and use a collections.Counter instance to count them. Sorting each pair first makes sure that you can disregard the order of items within a pair.

The product table can be transformed into a dictionary for easy lookup. This will provide a way to add the product names to the table of results.

from itertools import combinations
from collections import Counter

pairs_by_trans = trans.groupby(['ID', 'Year'])['productID'].agg(
                       lambda x: list(combinations(set(x), 2)))
pairs_flat = [tuple(sorted(pair)) for row in pairs_by_trans for pair in row]

counts = Counter(pairs_flat)
top_counts = pd.DataFrame(counts.most_common(5),
                          columns=['pair', 'count'])

prodname = {k: v for k, v in product.values}
top_counts['names'] = top_counts['pair'].apply(lambda x: (prodname[x[0]], 
                                                          prodname[x[1]]))

top_counts

    pair    count   names
0   (11, 22)    3   (phone, Charger)
1   (33, 77)    1   (eaphones, pin)
2   (77, 88)    1   (pin, cover)
3   (11, 77)    1   (phone, pin)
4   (22, 77)    1   (Charger, pin)

Hi @Arne by products purchased together, we mean bought on the same date , as this solution looks great can we also add a check to match the dates by any chance ? — Tushar, Jan 12 '22 at 22:45
Sure, you can pass a list of column names to the `groupby` method. I have edited the answer to show that. — Arne, Jan 13 '22 at 01:44

score 0 · Answer 2 · answered Jan 13 '22 at 00:13

The below solution works perfectly fine for me


transprod = pd.merge(trans,product,on='productID' , how='inner')



transprod['Rank'] = transprod.groupby('ID')['Year'].rank(method = 'dense').astype(int)
transprod = transprod.sort_values(['ID','productID','Rank'])


def checkprod(x):
    v1  = (x['Rank']==x['Rank'].shift(-1))
    
    return (x[v1 | v1.shift(1)])

out = transprod.groupby('ID').apply(checkprod).reset_index(drop=True)

pairs = out.groupby(['ID','Rank'])['prodname'].agg(
                       lambda x: list(combinations(set(x), 2)))


Counter(list(itertools.chain(*pairs)))

Python pandas how to group and match

2 Answers2