2

I'm trying to get the mean of each column while grouped by id, BUT for the calculation only the 50% between the first 25% quantil and the third 75% quantil should be used. (So ignore the lowest 25% of values and the highest 25%)

The data:

ID       Property3   Property2   Property3
1        10.2        ...         ...
1        20.1
1        51.9
1        15.8
1        12.5
...
1203     104.4
1203     11.5
1203     19.4
1203     23.1

What I tried:

data.groupby('id').quantile(0.75).mean();
#data.groupby('id').agg(lambda grp: grp.quantil(0.25, 0,75)).mean(); something like that?
CW            67.089733
fd             0.265917
fd_maxna   -1929.522001
fd_maxv    -1542.468399
fd_sumna   -1928.239954
fd_sumv    -1488.165382
planc        -13.165445
slope         13.654163

Something like that, but the GroupByDataFrame.quantil doesn't know a inbetween to my knowledge and I don't know how to now remove the lower 25% too. And this also doesn't return a dataframe.

What I want
Idealy, I would like to have a dataframe as follows:

ID       Property3   Property2   Property3
1        37.8        5.6         2.3
2        33.0        1.5         10.4
3        34.9        91.5        10.3
4        33.0        10.3        14.3

Where only the data between the 25% quantil and the 75% quantil are used for the mean calculation. So only the 50% in between.

Hans Vader
  • 183
  • 10

4 Answers4

1

You can use the quantile function to return multiple quantiles. Then, you can filter out values based on this, and compute the mean:

def filter_mean(df):
    bounds = df.quantile([.25, .75])
    mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
    return df[mask].mean()

means = data.groupby("id").apply(filter_mean)
Horace
  • 1,024
  • 7
  • 12
1

Please try this.

def mean_of_25_to_75_pct(s: pd.Series):
    low, high = s.quantile(.25), s.quantile(.75)
    return s.loc[(s >= low) & (s < high)].mean()

data.groupby("id").apply(lambda x: x.apply(mean_of_25_to_75_pct))
Kosuke Sakai
  • 2,336
  • 2
  • 5
  • 12
1

You could use scipy ready-made function for trimmed mean, trim_mean():

from scipy import stats

means = data.groupby("id").apply(stats.trim_mean, 0.25)

If you insist on getting a dataframe, you could:

data.groupby("id").agg(lambda x: stats.trim_mean(x, 0.25)).reset_index()
Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
1

Using GroupBy.apply here can be slow so I suppose this is your data frame:

print(df)
     ID  Property3   Property2   Property1
0     1       10.2   58.337589   45.083237
1     1       20.1   70.844807   29.423138
2     1       51.9   67.126043   90.558225
3     1       15.8   17.478715   41.492485
4     1       12.5   18.247211   26.449900
5  1203      104.4  113.728439  130.698964
6  1203       11.5   29.659894   45.991533
7  1203       19.4   78.910591   40.049054
8  1203       23.1   78.395974   67.345487

So I would use GroupBy.cumcount + DataFrame.pivot_table to calculate quantiles without using apply:

df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
print(new_df)

     Property1              Property2             Property3       
ID        1           1203       1           1203      1      1203
aux                                                               
0    45.083237  130.698964  58.337589  113.728439      10.2  104.4
1    29.423138   45.991533  70.844807   29.659894      20.1   11.5
2    90.558225   40.049054  67.126043   78.910591      51.9   19.4
3    41.492485   67.345487  17.478715   78.395974      15.8   23.1
4    26.449900         NaN  18.247211         NaN      12.5    NaN

#remove aux column
df=df.drop('aux',axis=1)

Now we calculate the mean using boolean indexing:

new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()

           ID  
Property1  1       59.963006
           1203    70.661294
Property2  1       49.863814
           1203    45.703292
Property3  1       15.800000
           1203    21.250000
dtype: float64

or create DataFrame with the mean:

mean_df=( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
                                                                                 .rename_axis(index=['Property','ID'])
                                                                                 .unstack('Property') )
print(mean_df)

Property  Property1  Property2  Property3
ID                                       
1         41.492485  58.337589      15.80
1203      56.668510  78.653283      21.25

Measure times:

%%timeit
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
df=df.drop('aux',axis=1)
( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
                                                                         .rename_axis(index=['Property','ID'])
                                                                         .unstack('Property') )

25.2 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
def mean_of_25_to_75_pct(s: pd.Series):
    low, high = s.quantile(.25), s.quantile(.75)
    return s.loc[(s >= low) & (s < high)].mean()

df.groupby("ID").apply(lambda x: x.apply(mean_of_25_to_75_pct))

33 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
def filter_mean(df):
    bounds = df.quantile([.25, .75])
    mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
    return df[mask].mean()

means = df.groupby("ID").apply(filter_mean)

23 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is even almost faster with a small data frame, in larger data frames such as its original data frame, it would be much faster than the other proposed methods, see: when use apply

ansev
  • 30,322
  • 5
  • 17
  • 31