How to get last 5 rows per column value on Pandas (5 actions before abandonment)

Question

Relatively new to Py and this forum, thanks in advance for the help.

I am trying to get the last 5 actions customers did before leaving a page on the website.

if i have a data sample like this:

   index  session_uuid  timestamp   action
0      1             1          2  action1
1      2             1          4  action2
2      3             1          5  action3
3      4             1          7  action4
4      5             2          2  action1
5      6             2          4  action2
6      7             2         10  action3
7      8             2         15  action4

and the desired result will be:

session_uiid - action-1 - action-2 - action-3 - action-4 - action-5
1 action4 action3 action2 action1
2 action4 action3 action2 action1

Preferible py3, i've tried the df.tail() function but not sure how to group by each session and then transpose to different columns

What if there are less than 5? Will those column values be empty? — cs95, May 17 '19 at 19:12
If there are less than 5 actions, there would not be a row for that action. It will be ok to display those rows as null or zero in the table. Thanks! @cs95 — TatoE, May 17 '19 at 19:30

BhishanPoudel · Answer 1 · 2019-05-17T20:29:33.367

Assuming session_uuid is the user and giving example of only last two actions. You can change 2 to 5 if you like.

import numpy as np
import pandas as pd

df = pd.DataFrame({'session_uuid': [1, 1, 1, 1, 2, 2, 2, 2],
          'timestamp': [2, 4, 5, 7, 2, 4, 10, 15],
          'action': ['action1', 'action2', 'action3', 'action4', 'action1', 'action2', 'action3', 'action4']})
print(df)
   session_uuid  timestamp   action
0             1          2  action1
1             1          4  action2
2             1          5  action3
3             1          7  action4
4             2          2  action1
5             2          4  action2
6             2         10  action3
7             2         15  action4

# first sort the values, then groupby users
df = df.sort_values(['session_uuid','timestamp'])
df1 = df.groupby('session_uuid')['action'].apply(lambda x: list(x)[-2:])
print(df1)
session_uuid
1    [action3, action4]
2    [action3, action4]

If you like to have dataframe instead of series:

df1 = df1.to_frame('action').reset_index()
print(df1)
   session_uuid              action
0             1  [action3, action4]
1             2  [action3, action4]

This was correct as well, the other solution prints each column separately. thanks for your help @I_am_Groot! — TatoE, May 17 '19 at 23:30

score 0 · Accepted Answer · answered May 17 '19 at 20:52

df.tail() will return the end of the entire dataset. What you're looking for is a little more complex than that. Here's some example code that solves that problem and generalizes to K last rows:

import pandas as pd
import numpy as np

# create the dataset example
index = [1, 2, 3, 4, 5, 6, 7, 8]
session_uuid = [1, 1, 1, 1, 2, 2, 2, 2]
timestamp = [2, 4, 5, 7, 2, 4, 10, 15]
action = ["action1", "action2", "action3", "action4",
          "action1", "action2", "action3", "action4"]
df = pd.DataFrame(
    { 
        "index": index,
        "session_uuid": session_uuid,
        "timestamp": timestamp,
        "action": action
    }
)
# the number of `last` actions you want
k = 2
# the dataframe to return will have k columns that are numbered
final_df = pd.DataFrame(columns=np.arange(k))
# group by session_uuid and sort them by timestamp inside those groups.  Finally, get the last K rows in those sorted groups
last_k = df.groupby("session_uuid", as_index=False).apply(pd.DataFrame.sort_values, "timestamp").groupby(level=0).tail(k).groupby("session_uuid")
# this grabs the session_uuid in the same order as above so we can have that column in the new dataframe
uuids = df.groupby("session_uuid", as_index=False).groups.keys()

# go through each group (or each uuid)
for group in last_k:
    # grab the action values out of the tuple
    group = group[1]["action"]
    # add the last actions to the new DataFrame but reshape it to match the dimensions of the new DataFrame
    final_df = final_df.append(pd.Series(group.values.reshape(k)), ignore_index=True)

# add the UUID columns for reference and put it at the beginning
final_df.insert(loc=0, column="session_uuid", value=uuids)
print(final_df)

This code takes your example dataset and returns the last two (you can adjust k) actions for each group. If there are less than K values it fills the blank space with a NaN value.

Sample output looks like:

   session_uuid        0        1
0             1  action3  action4
1             2  action3  action4

Or if you have less than K actions:

   session_uuid        0        1
0             1  action1      NaN
1             2  action3  action4

This is amazing, thank you very much @OrionTheHunter. There was just a small change - added session_uuid in the sort_values. last_k = df.groupby("session_uuid", as_index=False).apply(pd.DataFrame.sort_values, ['session_uuid','action_ts']).groupby(level=0).tail(k).groupby("session_uuid") — TatoE, May 17 '19 at 23:28

How to get last 5 rows per column value on Pandas (5 actions before abandonment)

if i have a data sample like this:

and the desired result will be:

2 Answers2