I have a table with customer purchasing product category. I am trying to build a cross-selling matrix counting the unique customer for each couple of product category, and have a total with unique count as well.
pandas.crosstab
is a good way to start, but fails on subtotals (i.e. margins=True
)
df = pd.DataFrame({
'cust': ['1', '1', '2', '3', '3', '4', '5'], # customer ID
'categ': ['a', 'b', 'a', 'a', 'b', 'b', 'b'] # category ID
})
# have 2 columns to make the crosstab
dd = pd.merge(df, df, on='cust')
Then pd.crosstab(dd.categ_x, dd.categ_y, margins=True)
gives:
| categ_x | a | b | All |
|---------|---|---|-----|
| a | 3 | 2 | 5 |
| b | 2 | 4 | 6 |
| All | 5 | 6 | 11 |
the pd.merge
helps to populate the cells with the right figures in the crosstable, but lead to bad counting for subtotals/margins
What I would expect is:
| categ_x | a | b | All |
|---------|---|---|-----|
| a | 3 | 2 | 3 | -- I have 3 unique clients with 'a'
| b | 2 | 4 | 4 | -- I have 4 unique clients with 'b'
| All | 3 | 4 | 5 | -- I have 5 unique clients in total
I have try some count, uniques... without much success so far.
EDIT
jezrael answer is fine, but I was wondering if their was a way to do it directly through crosstab
, using the right aggfunc
.