1

I have a data frame with some numerical values and a date-timestamp.

What I would like to do is aggregate the data into monthly intervals outputting a max percentile value for each month.

What I have been doing so far is just using:

df = df.resample('M', on='ds').max()

Which gives me the max value for that month. However, from what I can see in my data there are usually one or two spikes in each month. The result is that by using max() I will get that spike value - which is not correct. So I way to filter out the few high value peaks I was wondering if I could use a percentile function instead of max(), .e.g:

np.percentile(df['y'], 99)

As far as I can see the resample function does not provide the option to use own functions. But I might be wrong? In any case, how can this be accomplished ?

Denver Dang
  • 2,433
  • 3
  • 38
  • 68

1 Answers1

2

Use custom lambda function in GroupBy.agg:

df = df.resample('M', on='ds')['y'].agg(lambda x: np.percentile(x, 99))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252