0

Main_df

product_id viewed_date  viewed_storeA    viewed_storeB    viewed_storeA_first  time_delta   
323224     2019-04-01   2019-04-01 08:01 2019-04-01 08:20 True                 00:19:00
942234     2019-04-01   2019-04-01 08:13 2019-04-01 08:43 True                 00:30:00
424244     2019-04-01   2019-04-01 07:20 2019-04-01 08:20 True                 01:00:00
749249     2019-04-02   2019-04-02 06:00 2019-04-02 07:30 True                 01:30:00
224345     2019-04-02   2019-04-02 06:00 2019-04-02 08:00 True                 02:00:00

Desired output df

viewed_date  viewed_storeA_first_count time_delta_mean 
2019-04-01   3                         00:36:00
2019-04-02   2                         01:05:00

This is what I have tried so far but I get the following error: No numeric types to aggregate

df_grouped = df.groupby('viewed_date') \
       .agg({'viewed_storeA_first':'count', 'time_delta':'mean'}) \
       .rename(columns={'viewed_storeA_first':'viewed_storeA_first_count','time_delta':'time_delta_mean'}) \
       .reset_index()
user12625679
  • 676
  • 8
  • 23
  • what is the dtype of `time_delta`? also I guess this is similar to [Finding the mean and standard deviation of a timedelta object in pandas df](https://stackoverflow.com/questions/44616546/finding-the-mean-and-standard-deviation-of-a-timedelta-object-in-pandas-df) – anky Jul 28 '20 at 07:08
  • the dtype is timedelta64 – user12625679 Jul 28 '20 at 07:13
  • 1
    okay, in that case can you try the answer in the linked post – anky Jul 28 '20 at 07:14

1 Answers1

0
  • Currently, your time_delta column is a timedelta64 datatype, but to perform aggregation functions, it needs to be an integer
import pandas as pd

data = {'product_id': [323224, 942234, 424244, 749249, 224345],
        'viewed_date': ['2019-04-01', '2019-04-01', '2019-04-01', '2019-04-02', '2019-04-02'],
        'viewed_storeA': ['2019-04-01 08:01', '2019-04-01 08:13', '2019-04-01 07:20', '2019-04-02 06:00', '2019-04-02 06:00'],
        'viewed_storeB': ['2019-04-01 08:20', '2019-04-01 08:43', '2019-04-01 08:20', '2019-04-02 07:30', '2019-04-02 08:00'],
        'viewed_storeA_first': [True, True, True, True, True]}

df = pd.DataFrame(data)

# convert columns to datetime
for col in df.columns[1:4]:
    df[col] = pd.to_datetime(df[col])

# calculate time_delta as an int
df['time_delta'] = (df.viewed_storeB - df.viewed_storeA).astype('timedelta64[m]')

# groupby and perform aggregations
dfg = df.groupby(['viewed_date']).agg({'viewed_storeA_first': 'count', 'time_delta': 'mean'}).rename(columns={'viewed_storeA_first':'viewed_storeA_first_count','time_delta':'time_delta_mean'}).reset_index()

# optional: round the minutes
dfg.time_delta_mean = dfg.time_delta_mean.apply(round)

# convert time_delta_mean back to a timedelta
dfg.time_delta_mean = pd.to_timedelta(dfg.time_delta_mean, unit='m')

# display(dfg)
  viewed_date  viewed_storeA_first_count time_delta_mean
0  2019-04-01                          3        00:36:00
1  2019-04-02                          2        01:45:00
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158