1

I have a recurring pandas issue, which I'd like to solve by wrapping the .resample method. I just can't figure out, how.

Background (not essential)

I have timezone-aware timeseries, e.g.:

s = pd.Series([5,19,-4], pd.date_range('2020-10-01', freq='D', periods=3, tz='Europe/Berlin', name='ts_left'))
s

ts_left
2020-10-01 00:00:00+02:00    5
2020-10-02 00:00:00+02:00   19
2020-10-03 00:00:00+02:00   -4
Freq: D, dtype: int64

And I want to resample to hours. If I just use s.resample('H').sum(), the final 23 hours get dropped (also addressed in this question):

s.resample('H').sum()

ts_left
2020-10-01 00:00:00+02:00    5
2020-10-01 01:00:00+02:00    0
...
2020-10-01 23:00:00+02:00    0
2020-10-02 00:00:00+02:00   19
2020-10-02 01:00:00+02:00    0
...
2020-10-02 23:00:00+02:00    0
2020-10-03 00:00:00+02:00   -4
Freq: H, Length: 49, dtype: int64

Current "Solution"

I have written a custom resample2 function to correct this:

def resample2(df, freq, func):
    if type(df.index) != pd.DatetimeIndex:
        return df.resample(freq).apply(func)
    else: 
        #add one row
        idx = [df.index[-1] + df.index.freq]
        if type(df) == pd.DataFrame:
            df = df.append(pd.DataFrame([[None] * len(df.columns)], idx))
        elif type(df) == pd.Series:
            df = df.append(pd.Series([None], idx))
        df = df.resample(freq).apply(func)
        return df.iloc[:-1] #remove one row

This works:

resample2(s, 'H', np.sum)

2020-10-01 00:00:00+02:00    5
2020-10-01 01:00:00+02:00    0
...
2020-10-01 23:00:00+02:00    0
2020-10-02 00:00:00+02:00   19
2020-10-02 01:00:00+02:00    0
...
2020-10-02 23:00:00+02:00    0
2020-10-03 00:00:00+02:00   -4
2020-10-03 01:00:00+02:00    0
...
2020-10-03 23:00:00+02:00    0
Freq: H, Length: 72, dtype: int64

but has 2 issues:

  • The usage is much different from the standard usage (resample2(s, 'H', np.sum) vs s.resample('H').sum(), and
  • I can't use all the functions I previously could. For example, resample2(s, 'H', s.resample.ffill) gives an error.

Question

Is there a way to wrap the functionality of the DataFrame.resample and Series.resample methods, so, that they can continue to work as they always have, just with the 'append a row before resampling, remove final row after resampling' functionality shown in my resample2 function?

ElRudi
  • 2,122
  • 2
  • 18
  • 33
  • (PS: The more satisfying solution would be to use a `PeriodIndex`, but they don't support timezones.) – ElRudi Sep 08 '20 at 11:17

1 Answers1

0

Issue 1 (The usage is much different from the standard usage):

Short of locally customizing your pandas package, I think what you are doing is close to the best you can do. I am not aware of any parameters of resample which allow this, and I'm not sure how you can customize the existing method of a DataFrame / Series.

But there may be a way to make your function into more of a helper which is used to pre- or post-process the data relative to resampling. Here is an alternative implementation of your function:

def allday_resample(df, freq, func):
    df = df.copy()
    begin = df.index.min().floor('D')
    end = df.index.max().ceil('D')
    if end == df.index.max():
        end += pd.offsets.Day(1)

    if begin not in df.index:
        df.loc[begin] = np.nan
    if end not in df.index:
        df.loc[end] = np.nan

    r = df.resample(freq).apply(func)
    return r[(r.index >= begin) &
             (r.index < end)]

This is very similar to your resample2 but with a few changes (improvements?):

  • Using df = df.copy(), it is explicit that we are returning a new object and not modifying the original data coming in (this could be changed)
  • It handles Series and DataFrames the same (so no need for an if-else)
  • It gives the complete values for the start and end days - I saw that resample2 could give differential results if your start/end timestamps were not at midnight (maybe this is moot if your data are always on midnight). See this example:
# now starting at 10:00
>>> s = pd.Series([5,19,-4], pd.date_range('2020-10-01 10:00', freq='D', periods=3, tz='Europe/Berlin', name='ts_left'))
>>> resample2(s, 'H', np.sum)

2020-10-01 10:00:00+02:00     5
2020-10-01 11:00:00+02:00     5
2020-10-01 12:00:00+02:00     5
2020-10-01 13:00:00+02:00     5
2020-10-01 14:00:00+02:00     5
                             ..
2020-10-04 05:00:00+02:00    -4
2020-10-04 06:00:00+02:00    -4
2020-10-04 07:00:00+02:00    -4
2020-10-04 08:00:00+02:00    -4
2020-10-04 09:00:00+02:00    -4
Freq: H, Length: 72, dtype: object

# missing timestamps for Oct 1st, and timestamps carried over into Oct 4th despite no original data on that day

I call this allday_resample because it is ensuring that the starting day, ending day, and all days in between are filled with the input freq. This might be more complicated if you wanted to resample to the minute, but only wanted the data to be filled to the hour (you would need a hierarchy of time frequency offsets to select). But I'm assuming for now you just are concerned with taking daily data and resampling to hourly.

>>> s = pd.Series([5,19,-4], pd.date_range('2020-10-01', freq='D', periods=3, tz='Europe/Berlin', name='ts_left'))
>>> allday_resample(s, 'H', np.sum)
ts_left
2020-10-01 00:00:00+02:00    5.0
2020-10-01 01:00:00+02:00    0.0
2020-10-01 02:00:00+02:00    0.0
2020-10-01 03:00:00+02:00    0.0
2020-10-01 04:00:00+02:00    0.0

2020-10-03 19:00:00+02:00    0.0
2020-10-03 20:00:00+02:00    0.0
2020-10-03 21:00:00+02:00    0.0
2020-10-03 22:00:00+02:00    0.0
2020-10-03 23:00:00+02:00    0.0
Freq: H, Length: 72, dtype: float64

But we could instead move it's steps into a function to edit our data before resampling, such that when we resample, we get the same output:

def preprocess(df):
    begin = df.index.min().floor('D')
    end = df.index.max().ceil('D')
    if end == df.index.max():
        end += pd.offsets.Day(1) - pd.Timedelta('1s')
    if begin not in df.index:
        df.loc[begin] = np.nan
    if end not in df.index:
        df.loc[end] = np.nan

Here, the incoming data is modified in place (and the function returns nothing). There is also a small step of subtracting 1 second (an arbitrary small delta) from the ceiling of the end date, so that we don't include any data from the subsequent day when we resample.

With this function you can do:

>>> preprocess(s)
>>> s.resample('H').sum()

ts_left
2020-10-01 00:00:00+02:00    5.0
2020-10-01 01:00:00+02:00    0.0
2020-10-01 02:00:00+02:00    0.0
2020-10-01 03:00:00+02:00    0.0
2020-10-01 04:00:00+02:00    0.0

2020-10-03 19:00:00+02:00    0.0
2020-10-03 20:00:00+02:00    0.0
2020-10-03 21:00:00+02:00    0.0
2020-10-03 22:00:00+02:00    0.0
2020-10-03 23:00:00+02:00    0.0
Freq: H, Length: 72, dtype: float64

Issue 2 (I can't use all the functions I previously could):

This is less tricky - You can still access these by using their string names rather than some other function (e.g. np.sum in your example). So for forward fill, you can do the following (with your resample2 as is):

>>> resample2(s, 'H', 'ffill')
2020-10-01 00:00:00+02:00     5
2020-10-01 01:00:00+02:00     5
2020-10-01 02:00:00+02:00     5
2020-10-01 03:00:00+02:00     5
2020-10-01 04:00:00+02:00     5
                             ..
2020-10-03 19:00:00+02:00    -4
2020-10-03 20:00:00+02:00    -4
2020-10-03 21:00:00+02:00    -4
2020-10-03 22:00:00+02:00    -4
2020-10-03 23:00:00+02:00    -4
Freq: H, Length: 72, dtype: object

By my eye/brief testing, doing x.resample().sum() and x.resample().apply('sum') are equivalent. See my question and someone else's answer about this here. And see the docs under Resampler.apply().. And above, when I used np.sum, I could have used 'sum'.

Tom
  • 8,310
  • 2
  • 16
  • 36