7

I am trying to count the number of times users look at pages in the same session.

I am starting with a data frame listing user_ids and the page slugs they have visited:

user_id page_view_page_slug
1       slug1
1       slug2
1       slug3
1       slug4
2       slug5
2       slug3
2       slug2
2       slug1

What I am looking to get is a pivot table counting user_ids of the cross section of slugs

. slug1 slug2 slug3 slug4 slug5
slug1 2 2 2 1 1
slug2 2 2 2 1 1
slug3 2 2 2 1 1
slug4 1 1 1 1 0
slug5 1 1 1 0 1

I realize this will be the same data reflected when we see slug1 and slug2 vs slug2 and slug1 but I can't think of a better way. So far I have done a listagg

def listagg(df, grouping_idx):
    return df.groupby(grouping_idx).agg(list)
new_df = listagg(df,'user_id')

Returning:

          page_view_page_slug
user_id                                                   
1        [slug1, slug2, slug3, slug4]
2        [slug5, slug3, slug2, slug2]
7        [slug6, slug4, slug7]
9        [slug3, slug5, slug1]

But I am struggling to think of loop to count when items appear in a list together (despite the order) and how to store it. Then I also do not know how I would get this in a pivotable format.

young_matt
  • 73
  • 4
  • have you tried `df.pivot`? – ddg Feb 03 '21 at 22:23
  • Is there a possibility of repetitions in the data? For example, if `(user_id 1, slug1)` repeats, then what should be the result? For the two current answers, it either creates an extra row & column for `slug1` (repeats) or gets added twice and the top row & first column become `[5, 3, 3, 2, 1]`, which I think is more accurate. – aneroid Feb 14 '21 at 12:50

3 Answers3

3

Here is another way by using numpy broadcasting to create a matrix which is obtained by comparing each value in user_id with every other value, then create a new dataframe from this matrix with index and columns set to page_view_page_slug and take sum on level=0 along axis=0 and axis=1 to count the user_ids of the cross section of slugs:

a = df['user_id'].values
i = list(df['page_view_page_slug'])

pd.DataFrame(a[:, None] == a, index=i, columns=i)\
   .sum(level=0).sum(level=0, axis=1).astype(int)

       slug1  slug2  slug3  slug4  slug5
slug1      2      2      2      1      1
slug2      2      2      2      1      1
slug3      2      2      2      1      1
slug4      1      1      1      1      0
slug5      1      1      1      0      1
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • I think this one is better, because _if_ there is a repetition, such as `(user_id 1, slug1)`, the top row & first column become `[5, 3, 3, 2, 1]`, which I think is more accurate. In your other answer, it creates an extra row & column for `slug1` (repeats). _(Although, I think `[3, 3, 3, 2, 1]` would be the right result in that case.)_ Wait for OP to confirm [if there can be repetitions](https://stackoverflow.com/questions/66036510/creating-a-pandas-pivot-table-to-count-number-of-times-items-appear-in-a-list-to/66193929#comment117029555_66036510) and what the result should look like. – aneroid Feb 14 '21 at 12:55
  • Thanks @aneroid yeah it depends on OP although I haven't checked the results when there is repetition but I guess `[5, 3, 3, 2, 1]` seems more correct and really thanks for testing the code :) – Shubham Sharma Feb 14 '21 at 15:35
2

Let's try groupby and reduce:

from functools import reduce

dfs = [pd.DataFrame(1, index=list(s), columns=list(s)) 
      for _, s in df.groupby('user_id')['page_view_page_slug']]
      
df_out = reduce(lambda x, y: x.add(y, fill_value=0), dfs).fillna(0).astype(int)

Details:

group the dataframe on user_id then for each group in page_view_page_slug per user_id create an adjacency dataframe with index and columns corresponding to the slugs in that group.

>>> dfs

[       slug1  slug2  slug3  slug4
 slug1      1      1      1      1
 slug2      1      1      1      1
 slug3      1      1      1      1
 slug4      1      1      1      1,
        slug5  slug3  slug2  slug1
 slug5      1      1      1      1
 slug3      1      1      1      1
 slug2      1      1      1      1
 slug1      1      1      1      1]

Now reduce the above adjacency dataframes using a reduction function DataFrame.add with optional parameter fill_value=0 so as to count the user_ids of the cross section of slugs.

>>> df_out

       slug1  slug2  slug3  slug4  slug5
slug1      2      2      2      1      1
slug2      2      2      2      1      1
slug3      2      2      2      1      1
slug4      1      1      1      1      0
slug5      1      1      1      0      1

Optionally you can wrap the above code in a function as follows:

def count():
    df_out = pd.DataFrame()
    for _, s in df.groupby('user_id')['page_view_page_slug']:
        df_out = df_out.add(
            pd.DataFrame(1, index=list(s), columns=list(s)), fill_value=0)

    return df_out.fillna(0).astype(int)

>>> count()

       slug1  slug2  slug3  slug4  slug5
slug1      2      2      2      1      1
slug2      2      2      2      1      1
slug3      2      2      2      1      1
slug4      1      1      1      1      0
slug5      1      1      1      0      1
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Nice one. And not a `pivot` in sight! +1 (Because it was obvious that a simple pivot doesn't do what's needed.) Is `s` the Series version of what the OP had at the top? – aneroid Feb 13 '21 at 16:46
  • 1
    @aneroid Thanks! `s` is `page_view_page_slug` per `user_id` so for example for `user_id=1` `s` would be series containing the slug values `['slug1', 'slug2', 'slug3', 'slug4'] ` similarly for `user_id=2` `s` would be a series containing the values `['slug5', 'slug3', 'slug2', 'slug1']` – Shubham Sharma Feb 13 '21 at 16:50
1

Let's use self-join on user_id with merge and pd.crosstab to count:

import pandas as pd
from io import StringIO

txt = StringIO("""user_id  page_view_page_slug
1       slug1
1       slug2
1       slug3
1       slug4
2       slug5
2       slug3
2       slug2
2       slug1""")

df = pd.read_csv(txt, sep='\s\s+')

dfm = df.merge(df, on='user_id')
df_out = pd.crosstab(dfm['page_view_page_slug_x'], dfm['page_view_page_slug_y'])

df_out

Output:

page_view_page_slug_y  slug1  slug2  slug3  slug4  slug5
page_view_page_slug_x                                   
slug1                      2      2      2      1      1
slug2                      2      2      2      1      1
slug3                      2      2      2      1      1
slug4                      1      1      1      1      0
slug5                      1      1      1      0      1

For repetition of data, let's try:

dfi = df.assign(v_count=df.groupby(['user_id', 'page_view_page_slug']).cumcount())

#Let's filter some unnecessary joins with query
dfi = dfi.merge(dfi, on=['user_id'])\
         .query('page_view_page_slug_x != page_view_page_slug_y or page_view_page_slug_x == page_view_page_slug_y and v_count_x == v_count_y')

df_out = pd.crosstab(dfi['page_view_page_slug_x'], dfi['page_view_page_slug_y'])
df_out

Output:

page_view_page_slug_y  slug1  slug2  slug3  slug4  slug5
page_view_page_slug_x                                   
slug1                      3      3      3      2      1
slug2                      3      2      2      1      1
slug3                      3      2      2      1      1
slug4                      2      1      1      1      0
slug5                      1      1      1      0      1
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Just one merge and a crosstab. Nice! Didn't think to self-join +1. In case the OP responds about [repetition of data](https://stackoverflow.com/questions/66036510/creating-a-pandas-pivot-table-to-count-number-of-times-items-appear-in-a-list-to#comment117029555_66036510), what would be an ideal change? Currently, it gives `[5, 3, 3, 2, 1]` as the first row. – aneroid Feb 19 '21 at 02:24
  • Selecting this one as the bounty answer since it applies just `merge` and `crosstab` and less mind-bending to get the required result. – aneroid Feb 20 '21 at 12:48