4

I have the following issue with python pandas (I am relatively new to it): I have a simple dataset with a column for date, and a corresponding column of values. I am able to sort this Dataframe by date and value by doing the following:

df = df.sort_values(['date', 'value'],ascending=False)

I obtain this:

date       value
2019-11    100
2019-11    89
2019-11    87
2019-11    86   
2019_11    45
2019_11    33
2019_11    24
2019_11    11
2019_11    8
2019_11    5
2019-10    100 
2019-10    98
2019-10    96
2019-10    94
2019_10    94
2019_10    78
2019_10    74
2019_10    12
2019_10    3
2019_10    1

Now, what I would like to do, is to get rid of the lowest fifth percentile for the value column for EACH month (each group). I know that I should use a groupby method, and perhaps also a function:

df = df.sort_values(['date', 'value'],ascending=False).groupby('date', group_keys=False).apply(<???>)

The ??? is where I am struggling. I know how to suppress the lowest 5th percentile on a sorted Dataframe as a WHOLE, for instance by doing:

df = df[df.value > df.value.quantile(.05)]

This was the object of another post on StackOverflow. I know that I can also use numpy to do this, and that it is much faster, but my issue is really how to apply that to EACH GROUP independently (each portion of the value column sorted by month) in the Dataframe, not just the whole Dataframe.

Any help would be greatly appreciated Thank you so very much, Kind regards, Berti

Berti1989
  • 185
  • 1
  • 14

2 Answers2

7

Use GroupBy.transform with lambda function for Series with same size like original DataFrame, so possible filter by boolean indexing:

df = df.sort_values(['date', 'value'],ascending=False)

q = df.groupby('date')['value'].transform(lambda x: x.quantile(.05))
df = df[df.value > q]
print (df)
       date  value
4   2019_11     45
5   2019_11     33
6   2019_11     24
7   2019_11     11
8   2019_11      8
14  2019_10     94
15  2019_10     78
16  2019_10     74
17  2019_10     12
18  2019_10      3
0   2019-11    100
1   2019-11     89
2   2019-11     87
10  2019-10    100
11  2019-10     98
12  2019-10     96
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You could create your own function and apply it:

def remove_bottom_5_pct(arr):
    thresh = np.percentile(arr, 5)
    return arr[arr > thresh]

df.groupby('date', sort=False)['value'].apply(remove_bottom_5_pct)

[out]

date       
2019-11  0     100
         1      89
         2      87
         3      86
         4      45
         5      33
         6      24
         7      11
         8       8
2019-10  10    100
         11     98
         12     96
         13     94
         14     94
         15     78
         16     74
         17     12
         18      3
Name: value, dtype: int64
Chris Adams
  • 18,389
  • 4
  • 22
  • 39