1

Please help to find an optimal solution for this task.

We have a pandas dataframe with two main date columns and many others (and >20mln rows).

Here is a toy example of the dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
                   'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
                  'clients': ['client1', 'client2', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})
})

The input df

We need to groupby and transform the dataframe on the level of clients but with condition that we work with certain window for each client: only if date1 + 12m <= sample_date.

The outcome would be a new column in df dataframe with these values: The result:

Below is my very slow code, it works but it is very slow: Please help to optimize it using pandas methods, which I'm not still aware of!

# initialzing outcome column
df['count_products'] = np.nan

for i in range(df.shape[0]):
    df_temp = df[(df['date1'] + pd.DateOffset(months=12)) <= df['sample_date'].iloc[i]]
    df_temp = df_temp[df_temp['clients'] == df['clients'].iloc[i]]
    df['count_products'][i] = df_temp.groupby('clients')['products'].count()

I would appreciate any help!

Latest update 31.05.2023: Additional dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
                   'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
                  'clients': ['client1', 'client1', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

The result

The result should be this:

Latest update: The logic is to count for each client & sample_date pair, the number of products whose date1 is at least or equal to 12M ago.

Ani
  • 13
  • 5

3 Answers3

0

If you vectorize everything, it might be faster. Also you need cumcount and first (this assumes the first date is the earliest).

import pandas as pd
import numpy as np

df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
                   'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
                  'clients': ['client1', 'client2', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

df1 = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
                   'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
                  'clients': ['client1', 'client1', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

df['date0'] = df.groupby('clients')['date1'].transform('first')

vec=(df['date0'] + pd.DateOffset(months=12)) <= df['sample_date']

df.loc[vec,'count'] = df.loc[vec].groupby('clients')['products'].transform('cumcount')+1

print(df)

df1['date0'] = df1.groupby('clients')['date1'].transform('first')

vec1=(df1['date0'] + pd.DateOffset(months=12)) <= df1['sample_date']

df1.loc[vec1, 'count'] = df1.loc[vec1].groupby('clients')['products'].transform('cumcount')+1

print(df1)

Output

       date1 sample_date  clients  products      date0  count
0 2021-04-15  2022-04-30  client1  product1 2021-04-15    1.0
1 2020-05-01  2022-04-30  client2  product2 2020-05-01    1.0
2 2022-12-31  2022-01-30  client1  product3 2021-04-15    NaN
3 2020-11-01  2021-12-30  client2  product4 2020-05-01    2.0
       date1 sample_date  clients  products      date0  count
0 2018-06-08  2018-05-31  client1  product1 2018-06-08    NaN
1 2019-07-30  2019-07-24  client1  product2 2018-06-08    1.0
2 2021-07-07  2021-06-28  client1  product3 2018-06-08    2.0
3 2020-01-11  2021-12-30  client2  product4 2020-01-11    1.0
Vons
  • 3,277
  • 2
  • 16
  • 19
  • Vons, thank you very much, but the code has the same problem as Emma's code: masking by vec is not correct. You need to compare each sample date with ALL (date1+ 12M) values by clients. Vec only compares values that are on the same row. – Ani May 30 '23 at 21:04
  • @Ani could you check if it is correctly working now? – Vons May 31 '23 at 02:40
  • Vons, thank you very much. I tested your code, it fails becasue you compare sample date with only date0 - the minimum date1 by client. But it is needed to be compared with each date1 on client level. – Ani May 31 '23 at 08:06
0

Your filter and groupby code looks good, you just need to apply it for the entire dataframe.

mask = (df['date1'] + pd.DateOffset(months=12)) <= df['sample_date']
df.loc[mask, 'count'] = df.loc[mask].groupby('clients')['products'].transform('count')

Update

IIUC, for each clients & sample_date pair, you want to count products whose date1 is at least or equal to 12M ago.

If that is the case, you can do a self-join and count with a condition.

df = df.merge(df[['clients', 'date1']], on='clients', how='left', suffixes=('', '_y'))

mask = (df.date1 + pd.DateOffset(months=12)) <= df.sample_date
df.loc[mask, 'count_products'] = (df.loc[mask].groupby(['clients', 'sample_date'])
                                  .products
                                  .transform('count'))
df['count_products'] = (df.groupby(['clients', 'sample_date'])
                        .count_products
                        .transform(max))
df = (df.drop_duplicates(subset=['clients', 'sample_date'])
      .drop('date1_y', axis=1))

Result

       date1 sample_date  clients  products  count_products
0 2018-06-08  2018-05-31  client1  product1             NaN
3 2019-07-30  2019-07-24  client1  product2             1.0
6 2021-07-07  2021-06-28  client1  product3             2.0
9 2020-01-11  2021-12-30  client2  product4             1.0
Emma
  • 8,518
  • 1
  • 18
  • 35
  • Hi Emma, thank you very much! Your code gives right result on this toy dataset, but it is logically incorrect, it fails on other data. Mask is incorrect, becasue each sample date of each client should be compared with ALL other values of (date1 + 12M) of the same client. – Ani May 30 '23 at 20:57
  • do you want to compare with max/min date1 per client? could you update your post to have a representative data to show case why this code doesn't work and elaborate on the logic? – Emma May 30 '23 at 21:29
  • 1
    sure, thanks for the reply, I will update it soon. – Ani May 30 '23 at 21:34
  • Emma, I just added another example of the dataset – Ani May 30 '23 at 21:51
  • could you add the expected output for the new data and explain the logic? I don't still understand which data you are intending to compare against. – Emma May 30 '23 at 21:54
  • the output is also added! – Ani May 30 '23 at 21:55
  • also added some explanation. – Ani May 30 '23 at 22:19
  • Emma, thank you very much, you are absolutely rigth that for each clients & sample_date pair, we need to count products whose date1 is at least or equal to 12M ago. I run your code, but it gives incorrect output(. It returns NaN in every row for client1. Can't figure out why! could you please help again? – Ani May 31 '23 at 07:07
  • df.loc[mask] returns empty df for client1 – Ani May 31 '23 at 09:29
  • Try checking the `df[(df.clients == 'client1') & (df.sample_date == '2021-06-28')]` right after the join and see if 1) you have all rows joined, 2) how many rows that matches the date condition? – Emma May 31 '23 at 14:30
0

The answer is the following:

  1. you can use iterrows(). Iterate in pandas only with iterrows()

  2. there is a better way:

def calc_func(partition):
    partition['count_products'] = partition.apply(lambda row: partition.loc[partition['date1'] <= row['sample_date'],'products'].count(), axis = 1)
    return partition
    
result_df = df.groupby('clients').apply(calc_func)
result_df.groupby(['clients', 'sample_date'], as_index = False)['count_products'].first(), on = ['clients', 'sample_date'])
Ani
  • 13
  • 5