2

I have a table with customer purchasing product category. I am trying to build a cross-selling matrix counting the unique customer for each couple of product category, and have a total with unique count as well.

pandas.crosstab is a good way to start, but fails on subtotals (i.e. margins=True)

df = pd.DataFrame({
    'cust':  ['1', '1', '2', '3', '3', '4', '5'], # customer ID
    'categ': ['a', 'b', 'a', 'a', 'b', 'b', 'b']  # category ID
})
# have 2 columns to make the crosstab
dd = pd.merge(df, df, on='cust')

Then pd.crosstab(dd.categ_x, dd.categ_y, margins=True) gives:

| categ_x | a | b | All | 
|---------|---|---|-----| 
| a       | 3 | 2 | 5   | 
| b       | 2 | 4 | 6   | 
| All     | 5 | 6 | 11  | 

the pd.merge helps to populate the cells with the right figures in the crosstable, but lead to bad counting for subtotals/margins

What I would expect is:

| categ_x | a | b | All | 
|---------|---|---|-----| 
| a       | 3 | 2 | 3   | -- I have 3 unique clients with 'a'
| b       | 2 | 4 | 4   | -- I have 4 unique clients with 'b'
| All     | 3 | 4 | 5   | -- I have 5 unique clients in total

I have try some count, uniques... without much success so far.

EDIT

jezrael answer is fine, but I was wondering if their was a way to do it directly through crosstab, using the right aggfunc.

ant1j
  • 305
  • 2
  • 18

2 Answers2

5

I think this does what I need:

pd.crosstab(
    dd.categ_x, dd.categ_y, margins=True, 
    values=dd.cust, aggfunc=pd.Series.nunique
)

gives:

| categ_x | a | b | All |
|---------|---|---|-----|
| a       | 3 | 2 | 3   |
| b       | 2 | 4 | 4   |
| All     | 3 | 4 | 5   |

pd.Series.nunique gives the length/size of the unique values of values (here dd.cust).

ant1j
  • 305
  • 2
  • 18
1

You can count values by groupby.nunique and add values manually by join and append:

s = df2.groupby(['categ'])['cust'].nunique().rename('All')
s1 = s.append(pd.Series({'All': df2['cust'].nunique()}, name='All'))

df = pd.crosstab(dd.categ_x, dd.categ_y).join(s).append(s1)
print (df)
         a  b  All
categ_x           
a        3  2    3
b        2  4    4
All      3  4    5

Another similar solution:

s = df2.groupby(['categ'])['cust'].nunique().rename('All')
df = pd.crosstab(dd.categ_x, dd.categ_y).join(s).append(s)
df.loc['All','All'] = df2['cust'].nunique()
df = df.astype(int)
print (df)
         a  b  All
categ_x           
a        3  2    3
b        2  4    4
All      3  4    5
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks this is definitely useful. Iwas just wondering if it can be done in 1 call of `crosstab` – ant1j Sep 19 '18 at 01:14