2

I have a function in python that I use to distribute values when upsampling. For example, to upsample distances my car's driven from per month to per day:

def distribute(df, freq:str):
    # if there's an easier way please do comment
    df_new = df.resample(freq).asfreq().fillna(0)
    return df_new.groupby(pd.Grouper(freq=df.index.freq)).transform(np.mean)

import pandas as pd
import numpy as np
distances = pd.Series([300, 300], pd.period_range('2020-02', freq='M', periods=2))
distribute(distances, 'D')
2020-02-01    10.344828
2020-02-02    10.344828
2020-02-03    10.344828
2020-02-04    10.344828
...                 ...
2020-03-28     9.677419
2020-03-29     9.677419
2020-03-30     9.677419
2020-03-31     9.677419
Freq: D, dtype: float64

The function divides each month's value evenly over the number of days in that month, which causes the 2020-02 value to be divided by 29, and the 2020-03 one by 31, as wanted, in this case.


However, when upsampling to a frequency in which the periods have a non-uniform duration, this gives me an unwanted result. Two situations with this property:

  1. Year-to-month:
distances2 = pd.Series([366], pd.PeriodIndex(['2020'], freq='Y'))
distribute(distances2, 'M')
2020-01    30.5
2020-02    30.5
...         ...
2020-11    30.5
2020-12    30.5
Freq: M, dtype: float64

What I want is for the year's value to be divided over the months, with each month receiving a fraction in proportion to its duration. i.e., I want the year value to be split over the months as 31/366 * x, 29/366 * x, etc:

2020-01    31
2020-02    29
           ...
2020-11    30
2020-12    31
Freq: M, dtype: float64

Is there a way to do that?

  1. DST

The second situation is in DST transitions, and it's actually already showing in my initial example. 2020-03-29 is 1h shorter than the other March days in my timezone, so it should actually receive a smaller fraction of the March value than the other days.

Although it's the same kind of problem as situation 1, I suspect it's going to be a lot harder to address.


EDIT: I have found a way to solve situation 1, but not situation 2; see my answer below this question. Help still appreciated in improving my answer and including also the second situation.


If we find a robust way to do this that is a bit elaborat, it I might be a good feature to request (or try to add it myself with a pull request), as it seems to be a good addition. So, to extend the PeriodIndexResampler api to allow for a .distribute() method that with this functionality, besides the .ffill(), sum(), .mean() etc. methods.

ElRudi
  • 2,122
  • 2
  • 18
  • 33
  • Your first question is simple enough and yes there is a shorter method for it: `dist /= dist.index.days_in_month ` and `result = dist.resample('D').asfreq().ffill() ` as for your second question it isn't quite clear what the requirement is, could you please add a data sample or open a new question please. – cs95 Jul 19 '20 at 22:15
  • Hmm, for your second question, you can upsample to months first (evenly dividing x by 12), then you can use the solution in my first comment which further upsamples from month to day. That should work, no? – cs95 Jul 19 '20 at 22:19
  • Thanks @cs95 for your answer. Is there a way to get the divider (`dist.index.days_in_month`) any other way? I want to turn this into a function with the new freq as a parameter, so it's not always from `M`onths to `D`ays. I've added an example for question 2; what you suggest does not work here. – ElRudi Jul 19 '20 at 22:30
  • Not unless you infer the frequency of the existing distribution. Then you can create a function that picks the correct divider. There are ways to infer the frequency pretty easily, some quick googling will find you the answer (not at my desk do can't share links myself). – cs95 Jul 19 '20 at 22:33

1 Answers1

0

As a partial answer, to tackle problem 1, I found that I can first upsample to a higher frequency period that has a uniform duration, e.g., 'D', and than downsample to the wanted frequency:

def distribute(df, freq:str):
    # now it's really wild
    df_D = df.resample('D').asfreq().fillna(0)
    df_D_spread = df_D.groupby(pd.Grouper(freq=df.index.freq)).transform(np.mean)
    return df_D_spread.groupby(pd.Grouper(freq=freq)).sum()

distances2 = pd.Series([366], pd.PeriodIndex(['2020'], freq='Y'))
distribute(distances2, 'M')
2020-01    31.0
2020-02    29.0
...         ...
2020-11    30.0
2020-12    31.0
Freq: M, dtype: float64

However/remarks:

  • It's quite convoluted and hard to read. And wastful of memory space too, expanding a year into 366 or 365 rows. There must be a better way?

  • It does not address the second problem.
    In fact, I chose 'D' instead of 'H' because periods don't support timezones. In reality, not all days are of equal length, and whenever we find a way to include that fact, we might need change the code to upsample to 'H' (or even '15T' for fractional timezones).

ElRudi
  • 2,122
  • 2
  • 18
  • 33