2

I have a pandas df like the following:

User    Purchase_Count    Location_Count
1       2                 3
2       10                5
3       5                 1
4       20                4
5       2                 3
6       2                 3
7       10                5

How would I add a column that calculates the % of the coordinate pair (Purchse_Count[i], Location_Count[i]) of the total entries. so for example I would like the df to look like:

User    Purchase_Count    Location_Count    %
1       2                 3                 42.85
2       10                5                 28.57
3       5                 1                 14.28
4       20                4                 14.28
5       2                 3                 42.85
6       2                 3                 42.85
7       10                5                 28.57
piRSquared
  • 285,575
  • 57
  • 475
  • 624
Hashtag
  • 23
  • 3

2 Answers2

2

A pandas solution would be to use groupby then transform:

In [43]: df
Out[43]:
   User  Purchase_Count  Location_Count
0     1               2               3
1     2              10               5
2     3               5               1
3     4              20               4
4     5               2               3
5     6               2               3
6     7              10               5

In [44]: total = len(df)

In [45]: df['percentage'] = df.groupby(['Purchase_Count', 'Location_Count']).transform(lambda r: r.count()/total)

In [46]: df
Out[46]:
   User  Purchase_Count  Location_Count  percentage
0     1               2               3    0.428571
1     2              10               5    0.285714
2     3               5               1    0.142857
3     4              20               4    0.142857
4     5               2               3    0.428571
5     6               2               3    0.428571
6     7              10               5    0.285714

Edit to improve readability

In [53]: df['percentage'] = (df.groupby(['Purchase_Count', 'Location_Count'])
    ...:                     .transform(lambda r: r.count()/total))

In [54]: df
Out[54]:
   User  Purchase_Count  Location_Count  percentage
0     1               2               3    0.428571
1     2              10               5    0.285714
2     3               5               1    0.142857
3     4              20               4    0.142857
4     5               2               3    0.428571
5     6               2               3    0.428571
6     7              10               5    0.285714

Edit:

As suggested by @piRSquared, you could use:

df.groupby(['Purchase_Count', 'Location_Count']).transform('count') / total

Instead, and preliminary tests show it is significantly faster.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • 2
    Since `total` is constant you should be able to get away with `df.groupby(['Purchase_Count', 'Location_Count']).transform('count') / total` Either way +1 – piRSquared Apr 06 '17 at 22:04
2

use groupby with size and join

cols = ['Purchase_Count', 'Location_Count']
df.join(df.groupby(cols).size().div(len(df)).rename('%'), on=cols)

   User  Purchase_Count  Location_Count         %
0     1               2               3  0.428571
1     2              10               5  0.285714
2     3               5               1  0.142857
3     4              20               4  0.142857
4     5               2               3  0.428571
5     6               2               3  0.428571
6     7              10               5  0.285714

old answer

Using pd.value_counts on tuples

tups = df[['Purchase_Count', 'Location_Count']].apply(tuple, 1)
df.assign(**{'%': tups.map(pd.value_counts(tups, normalize=True))})

   User  Purchase_Count  Location_Count         %
0     1               2               3  0.428571
1     2              10               5  0.285714
2     3               5               1  0.142857
3     4              20               4  0.142857
4     5               2               3  0.428571
5     6               2               3  0.428571
6     7              10               5  0.285714

timing

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624