0

I'm having trouble getting the mean value of a timedelta column.

My data looks like this:

user          date           Flag    Value        
0    ron  12/23/2016        'flag'    0 days 10:08:00     
1    ron  12/21/2016        'n/a'     0 days 08:00:00      
2    ron   12/23/2016       'flag'    0 days 10:08:00     
3    ron  12/21/2016        'n/a'     0 days 02:00:00      
4   andy   12/22/2016       'flag'    0 days 10:00:00     
5   andy   12/22/2016       'flag'    0 days 10:00:00     

I would like to generate an Avg column by taking the mean of Value per user based on Flag==flag. So the data would like this:

user          date           Flag    Value                   Avg
0    ron  12/23/2016        'flag'    0 days 10:08:00     0 days 10:08:00
1    ron  12/21/2016        'n/a'     0 days 08:00:00     0 days 10:08:00
2    ron   12/23/2016       'flag'    0 days 10:08:00     0 days 10:08:00
3    ron  12/21/2016        'n/a'     0 days 02:00:00     0 days 10:08:00
4   andy   12/22/2016       'flag'    0 days 10:00:00     0 days 10:00:00
5   andy   12/22/2016       'flag'    0 days 10:00:00     0 days 10:00:00

I have this code which produces a data error:

sample.loc[:,'Value'] = pd.to_timedelta(sample['Value'])
sample.loc[:,'Avg'] = sample['user'].map(sample[sample['Flag']=='flag'].groupby('user')['Value'].mean())

But this is the error i get:

DataError: No numeric types to aggregate

I'm not sure why it's saying this when I've converted Value to timedelta. Guidance is appreciated.

paranormaldist
  • 489
  • 5
  • 16
  • Does this answer your question? [Finding the mean and standard deviation of a timedelta object in pandas df](https://stackoverflow.com/questions/44616546/finding-the-mean-and-standard-deviation-of-a-timedelta-object-in-pandas-df) – Bruno Mello Mar 26 '20 at 15:41
  • Try this: https://stackoverflow.com/questions/44616546/finding-the-mean-and-standard-deviation-of-a-timedelta-object-in-pandas-df – Bruno Mello Mar 26 '20 at 15:41

1 Answers1

2

There are different problems here.

  1. You want to compute an average value per user from a subset. Ok: filter the relevant rows, use a groupby and mean
  2. You want that value to be applied to all the values for the user. The normal way is to reindex before the groupby with the original index and use a transform after the groupby
  3. You are processing a Timedelta column. You must convert it to a numerical column. The trick here is that you should use an integer type but want to be able to use NaN values, so we have to convert twice, first time to int64, then to float64

It finally gives:

df['mean'] = pd.to_timedelta(df.loc[df['Flag'] == "'flag'", 'Value']
                             .astype('int64').astype('float64')
                             .reindex(df.index).groupby(df['user'])
                             .transform('mean'))

It gives:

   user       date    Flag    Value     mean
0   ron 2016-12-23  'flag' 10:08:00 10:08:00
1   ron 2016-12-21   'n/a' 08:00:00 10:08:00
2   ron 2016-12-23  'flag' 10:08:00 10:08:00
3   ron 2016-12-21   'n/a' 02:00:00 10:08:00
4  andy 2016-12-22  'flag' 10:00:00 10:00:00
5  andy 2016-12-22  'flag' 10:00:00 10:00:00

Nota: above assumes that the datatype of Value is timedelta64[ns] (pd.Timedelta). If not, you must first convert it to Timedelta with:

df['Value'] = pd.to_timedelta(df['Value'])
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252