-2

Relatively new to Py and this forum, thanks in advance for the help.

I am trying to get the last 5 actions customers did before leaving a page on the website.

if i have a data sample like this:

   index  session_uuid  timestamp   action
0      1             1          2  action1
1      2             1          4  action2
2      3             1          5  action3
3      4             1          7  action4
4      5             2          2  action1
5      6             2          4  action2
6      7             2         10  action3
7      8             2         15  action4

and the desired result will be:

session_uiid - action-1 - action-2 - action-3 - action-4 - action-5
1 action4 action3 action2 action1
2 action4 action3 action2 action1

Preferible py3, i've tried the df.tail() function but not sure how to group by each session and then transpose to different columns

cs95
  • 379,657
  • 97
  • 704
  • 746
TatoE
  • 1
  • 1
  • What if there are less than 5? Will those column values be empty? – cs95 May 17 '19 at 19:12
  • If there are less than 5 actions, there would not be a row for that action. It will be ok to display those rows as null or zero in the table. Thanks! @cs95 – TatoE May 17 '19 at 19:30

2 Answers2

0

Assuming session_uuid is the user and giving example of only last two actions. You can change 2 to 5 if you like.

import numpy as np
import pandas as pd

df = pd.DataFrame({'session_uuid': [1, 1, 1, 1, 2, 2, 2, 2],
          'timestamp': [2, 4, 5, 7, 2, 4, 10, 15],
          'action': ['action1', 'action2', 'action3', 'action4', 'action1', 'action2', 'action3', 'action4']})
print(df)
   session_uuid  timestamp   action
0             1          2  action1
1             1          4  action2
2             1          5  action3
3             1          7  action4
4             2          2  action1
5             2          4  action2
6             2         10  action3
7             2         15  action4

# first sort the values, then groupby users
df = df.sort_values(['session_uuid','timestamp'])
df1 = df.groupby('session_uuid')['action'].apply(lambda x: list(x)[-2:])
print(df1)
session_uuid
1    [action3, action4]
2    [action3, action4]

If you like to have dataframe instead of series:

df1 = df1.to_frame('action').reset_index()
print(df1)
   session_uuid              action
0             1  [action3, action4]
1             2  [action3, action4]

BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169
  • This was correct as well, the other solution prints each column separately. thanks for your help @I_am_Groot! – TatoE May 17 '19 at 23:30
0

df.tail() will return the end of the entire dataset. What you're looking for is a little more complex than that. Here's some example code that solves that problem and generalizes to K last rows:

import pandas as pd
import numpy as np

# create the dataset example
index = [1, 2, 3, 4, 5, 6, 7, 8]
session_uuid = [1, 1, 1, 1, 2, 2, 2, 2]
timestamp = [2, 4, 5, 7, 2, 4, 10, 15]
action = ["action1", "action2", "action3", "action4",
          "action1", "action2", "action3", "action4"]
df = pd.DataFrame(
    { 
        "index": index,
        "session_uuid": session_uuid,
        "timestamp": timestamp,
        "action": action
    }
)
# the number of `last` actions you want
k = 2
# the dataframe to return will have k columns that are numbered
final_df = pd.DataFrame(columns=np.arange(k))
# group by session_uuid and sort them by timestamp inside those groups.  Finally, get the last K rows in those sorted groups
last_k = df.groupby("session_uuid", as_index=False).apply(pd.DataFrame.sort_values, "timestamp").groupby(level=0).tail(k).groupby("session_uuid")
# this grabs the session_uuid in the same order as above so we can have that column in the new dataframe
uuids = df.groupby("session_uuid", as_index=False).groups.keys()

# go through each group (or each uuid)
for group in last_k:
    # grab the action values out of the tuple
    group = group[1]["action"]
    # add the last actions to the new DataFrame but reshape it to match the dimensions of the new DataFrame
    final_df = final_df.append(pd.Series(group.values.reshape(k)), ignore_index=True)

# add the UUID columns for reference and put it at the beginning
final_df.insert(loc=0, column="session_uuid", value=uuids)
print(final_df)

This code takes your example dataset and returns the last two (you can adjust k) actions for each group. If there are less than K values it fills the blank space with a NaN value.

Sample output looks like:

   session_uuid        0        1
0             1  action3  action4
1             2  action3  action4

Or if you have less than K actions:

   session_uuid        0        1
0             1  action1      NaN
1             2  action3  action4
OrionTheHunter
  • 276
  • 1
  • 7
  • 1
    This is amazing, thank you very much @OrionTheHunter. There was just a small change - added session_uuid in the sort_values. last_k = df.groupby("session_uuid", as_index=False).apply(pd.DataFrame.sort_values, ['session_uuid','action_ts']).groupby(level=0).tail(k).groupby("session_uuid") – TatoE May 17 '19 at 23:28
  • Glad I could help! – OrionTheHunter May 19 '19 at 18:13