I have a function in python that I use to distribute values when upsampling. For example, to upsample distances my car's driven from per month to per day:
def distribute(df, freq:str):
# if there's an easier way please do comment
df_new = df.resample(freq).asfreq().fillna(0)
return df_new.groupby(pd.Grouper(freq=df.index.freq)).transform(np.mean)
import pandas as pd
import numpy as np
distances = pd.Series([300, 300], pd.period_range('2020-02', freq='M', periods=2))
distribute(distances, 'D')
2020-02-01 10.344828
2020-02-02 10.344828
2020-02-03 10.344828
2020-02-04 10.344828
... ...
2020-03-28 9.677419
2020-03-29 9.677419
2020-03-30 9.677419
2020-03-31 9.677419
Freq: D, dtype: float64
The function divides each month's value evenly over the number of days in that month, which causes the 2020-02
value to be divided by 29, and the 2020-03
one by 31, as wanted, in this case.
However, when upsampling to a frequency in which the periods have a non-uniform duration, this gives me an unwanted result. Two situations with this property:
- Year-to-month:
distances2 = pd.Series([366], pd.PeriodIndex(['2020'], freq='Y'))
distribute(distances2, 'M')
2020-01 30.5
2020-02 30.5
... ...
2020-11 30.5
2020-12 30.5
Freq: M, dtype: float64
What I want is for the year's value to be divided over the months, with each month receiving a fraction in proportion to its duration. i.e., I want the year value to be split over the months as 31/366 * x
, 29/366 * x
, etc:
2020-01 31
2020-02 29
...
2020-11 30
2020-12 31
Freq: M, dtype: float64
Is there a way to do that?
- DST
The second situation is in DST transitions, and it's actually already showing in my initial example. 2020-03-29
is 1h shorter than the other March days in my timezone, so it should actually receive a smaller fraction of the March value than the other days.
Although it's the same kind of problem as situation 1, I suspect it's going to be a lot harder to address.
EDIT: I have found a way to solve situation 1, but not situation 2; see my answer below this question. Help still appreciated in improving my answer and including also the second situation.
If we find a robust way to do this that is a bit elaborat, it I might be a good feature to request (or try to add it myself with a pull request), as it seems to be a good addition. So, to extend the PeriodIndexResampler
api to allow for a .distribute()
method that with this functionality, besides the .ffill()
, sum()
, .mean()
etc. methods.