2

I asked a similar, yet simpler, question previously but realized later that this does not solve my problem. I feel that the required edits to the question are too severe for a simple edit of the question, especially with two valid answers already, so I will let it stay up and instead ask a new one:

I have the following DataFrame of interactions (products viewed and/or bought) by different customers, at different times:

import pandas as pd
rng = list(pd.date_range('2019-02-24', periods=5, freq='T')) + list(pd.date_range('2019-03-13', periods=2, freq='T')) + list(pd.date_range('2019-02-27', periods=1, freq='T'))
customers = ["c12987"]*5 + ["c89563"]*2 + ["c56733"]
articles = ["a8473", "a7631", "a1264", "a8473", "a5641", "a9813", "a7631", "a1132"]
action_type = ["viewed", "purchased", "viewed", "purchased", "viewed", "viewed", "purchased", "viewed"]

interaction_history = pd.DataFrame({'Customer_no': customers, 'Date': rng, 'Article_no': articles, "Interaction": action_type}) 
interaction_history

Output:

    Customer_no Date                Article_no  Interaction
0   c12987      2019-02-24 00:00:00 a8473       viewed
1   c12987      2019-02-24 00:01:00 a7631       purchased
2   c12987      2019-02-24 00:02:00 a1264       viewed
3   c12987      2019-02-24 00:03:00 a8473       purchased
4   c12987      2019-02-24 00:04:00 a5641       viewed
5   c89563      2019-03-13 00:00:00 a9813       viewed
6   c89563      2019-03-13 00:01:00 a7631       purchased
7   c56733      2019-02-27 00:00:00 a1132       viewed

I would like to, for each customer and row, get the previous articles viewed as well as the previous articles bought.

Expected output:

    Customer_no Date                Article_no  Interaction Prev_viewed     Prev_purchased
0   c12987      2019-02-24 00:00:00 a8473       viewed      []              []
1   c12987      2019-02-24 00:01:00 a7631       purchased   [a8473]         []
2   c12987      2019-02-24 00:02:00 a1264       viewed      [a8473]         [a7631]
3   c12987      2019-02-24 00:03:00 a8473       purchased   [a8473, a1264]  [a7631]
4   c12987      2019-02-24 00:04:00 a5641       viewed      [a8473, a1264]  [a7631, a8473]
5   c89563      2019-03-13 00:00:00 a9813       viewed      []              []
6   c89563      2019-03-13 00:01:00 a7631       purchased   [a9813]         []
7   c56733      2019-02-27 00:00:00 a1132       viewed      []              []

I realize that I could iterate over each row with a custom function like interaction_history.apply(lambda x: my_custom_function(x), axis=1) where my_custom_function(x) would, for each row, filter through the entire interaction_history to find the matching Customer_no, Interaction and appropriate dates. I also realize this solution would be highly inefficient and very complex, thus hoping that someone has any other ideas!

Marcus
  • 943
  • 5
  • 21
  • groupby allows for organizing the file by each customer_no. If customer_no is unique for each individual, you can easily perform actions on each customer to determine transaction history – itprorh66 Jan 11 '21 at 19:37

2 Answers2

2

You can create a function that creates a new column doing the required according to viewed and purchased. The key here is to make the -Article_no rows in list format, so that you can use cumsum to cumulatively add each item to lists depending on viewed or purchased:

def previous(df, string):
    df['Article_no'] = df['Article_no'].str.split()
    col = 'Prev_' + string
    df[col] = (df[df['Interaction'].eq(string)].groupby('Customer_no')
               ['Article_no'].apply(lambda x: x.cumsum()))
    df[col] = df.groupby('Customer_no')[col].shift()
    df[col] = df.groupby('Customer_no')[col].ffill()
    df[col] = df[col].mask(df[col].isnull(), df[col].apply(lambda x: []))
    df['Article_no'] = df['Article_no'].str.join('')


previous(interaction_history, 'viewed')
previous(interaction_history, 'purchased')
Out[1]: 
    Customer_no Date                Article_no  Interaction Prev_viewed     Prev_purchased
0   c12987      2019-02-24 00:00:00 a8473       viewed      []              []
1   c12987      2019-02-24 00:01:00 a7631       purchased   [a8473]         []
2   c12987      2019-02-24 00:02:00 a1264       viewed      [a8473]         [a7631]
3   c12987      2019-02-24 00:03:00 a8473       purchased   [a8473, a1264]  [a7631]
4   c12987      2019-02-24 00:04:00 a5641       viewed      [a8473, a1264]  [a7631, a8473]
5   c89563      2019-03-13 00:00:00 a9813       viewed      []              []
6   c89563      2019-03-13 00:01:00 a7631       purchased   [a9813]         []
7   c56733      2019-02-27 00:00:00 a1132       viewed      []              []
David Erickson
  • 16,433
  • 2
  • 19
  • 35
2
  • groupby() Customer_no to generate lists of each of the attributes
  • explode() Date to get back to same row count
  • cumcount() to get relative index within Customer_no
  • list comprehensions to build lists you want
import pandas as pd
rng = list(pd.date_range('2019-02-24', periods=5, freq='T')) + list(pd.date_range('2019-03-13', periods=2, freq='T')) + list(pd.date_range('2019-02-27', periods=1, freq='T'))
customers = ["c12987"]*5 + ["c89563"]*2 + ["c56733"]
articles = ["a8473", "a7631", "a1264", "a8473", "a5641", "a9813", "a7631", "a1132"]
action_type = ["viewed", "purchased", "viewed", "purchased", "viewed", "viewed", "purchased", "viewed"]

df = pd.DataFrame({'Customer_no': customers, 'Date': rng, 'Article_no': articles, "Interaction": action_type}) 
df = df.merge((df.groupby("Customer_no").agg(lambda x: list(x)).explode("Date")
        .reset_index(drop=False)
        .assign(i=lambda dfa: dfa.groupby("Customer_no").cumcount()
            ,prev_viewed=lambda dfa: dfa.apply(lambda r: [r["Article_no"][i] for i in range(r["i"]) if r["Interaction"][i]=="viewed"], axis=1)
            ,prev_purcahsed=lambda dfa: dfa.apply(lambda r: [r["Article_no"][i] for i in range(r["i"]) if r["Interaction"][i]=="purchased"], axis=1)
        )
        .drop(columns=["Article_no","Interaction","i"])
        )
    ,on=["Customer_no","Date"]
)

output

Customer_no                Date Article_no Interaction     prev_viewed  prev_purcahsed
     c12987 2019-02-24 00:00:00      a8473      viewed              []              []
     c12987 2019-02-24 00:01:00      a7631   purchased         [a8473]              []
     c12987 2019-02-24 00:02:00      a1264      viewed         [a8473]         [a7631]
     c12987 2019-02-24 00:03:00      a8473   purchased  [a8473, a1264]         [a7631]
     c12987 2019-02-24 00:04:00      a5641      viewed  [a8473, a1264]  [a7631, a8473]
     c89563 2019-03-13 00:00:00      a9813      viewed              []              []
     c89563 2019-03-13 00:01:00      a7631   purchased         [a9813]              []
     c56733 2019-02-27 00:00:00      a1132      viewed              []              []
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Thanks for your effort. It works as expected, although David Erickson's answer is significantly faster for large dataframes, thus I will accept that answer. – Marcus Jan 12 '21 at 14:35