0

This is somewhat of a broad topic, but I will try to pare it to some specific questions.

I have noticed a difference between resample and groupby that I am curious to learn about. Here is some hourly time series data:

In[]:
import pandas as pd

dr = pd.date_range('01-01-2020 8:00', periods=10, freq='H')
df = pd.DataFrame({'A':range(10),
                   'B':range(10,20),
                   'C':range(20,30)}, index=dr)
df

Out[]:
                     A   B   C
2020-01-01 08:00:00  0  10  20
2020-01-01 09:00:00  1  11  21
2020-01-01 10:00:00  2  12  22
2020-01-01 11:00:00  3  13  23
2020-01-01 12:00:00  4  14  24
2020-01-01 13:00:00  5  15  25
2020-01-01 14:00:00  6  16  26
2020-01-01 15:00:00  7  17  27
2020-01-01 16:00:00  8  18  28
2020-01-01 17:00:00  9  19  29

I can downsample the data using either groupby with a freq pandas.Grouper or resample (which seems the more typical thing to do):

g = df.groupby(pd.Grouper(freq='2H'))
r = df.resample(rule='2H')

My impression was that these two were essentially the same thing (and correct me if I am wrong but resampleis a rebranded groupby)? But I have found that when using the apply method of each grouped object, you can index specific columns in the "DataFrameGroupBy" g object but not the "Resampler" object r:

def foo(d):
    return(d['A'] - d['B'] + 2*d['C'])

In[]:
g.apply(foo)

Out[]:
2020-01-01 08:00:00  2020-01-01 08:00:00    30
                     2020-01-01 09:00:00    32
2020-01-01 10:00:00  2020-01-01 10:00:00    34
                     2020-01-01 11:00:00    36
2020-01-01 12:00:00  2020-01-01 12:00:00    38
                     2020-01-01 13:00:00    40
2020-01-01 14:00:00  2020-01-01 14:00:00    42
                     2020-01-01 15:00:00    44
2020-01-01 16:00:00  2020-01-01 16:00:00    46
                     2020-01-01 17:00:00    48
dtype: int64

In[]:
r.apply(foo)

Out[]:
#long multi-Exception error stack ending in:
KeyError: 'A'

It looks like the data d that the apply "sees" is different in each case, as shown by:

def bar(d):
    print(d)

In[]:
g.apply(bar)

Out[]:
                     A   B   C
2020-01-01 08:00:00  0  10  20
2020-01-01 09:00:00  1  11  21
... #more DataFrames corresponding to each bin

In[]:
r.apply(bar)

Out[]:
2020-01-01 08:00:00    0
2020-01-01 09:00:00    1
Name: A, dtype: int64
2020-01-01 10:00:00    2
2020-01-01 11:00:00    3
Name: A, dtype: int64
... #more Series, first the bins for column "A", then "B", then "C" 

However, if you simply iterate over the Resampler object, you get the bins as DataFrames, which seems similar to groupby:

In[]:
for i, d in r:
    print(d)

Out[]:
                    A   B   C
2020-01-01 08:00:00  0  10  20
2020-01-01 09:00:00  1  11  21
                     A   B   C
2020-01-01 10:00:00  2  12  22
2020-01-01 11:00:00  3  13  23
                     A   B   C
2020-01-01 12:00:00  4  14  24
2020-01-01 13:00:00  5  15  25
                     A   B   C
2020-01-01 14:00:00  6  16  26
2020-01-01 15:00:00  7  17  27
                     A   B   C
2020-01-01 16:00:00  8  18  28
2020-01-01 17:00:00  9  19  29

The printout is the same when iterating over the DataFrameGroupBy object.

My questions based on the above?

  • Can you access specific columns using resample and apply? I thought I had code where I did this but now I think I am mistaken.
  • Why does the resample apply work on Series for each column for each bin, instead of DataFrames for each bin?

Any general comments about what is going on here, or whether this pattern should be encouraged or discouraged, would also be appreciated. Thanks!

Tom
  • 8,310
  • 2
  • 16
  • 36
  • 1
    Take a look at the docs for [Resampler.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.resample.Resampler.apply.html). This is actually an alias for [Resampler.aggregate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.resample.Resampler.aggregate.html) which is only capable of performing row-wise transformation. What that means is that resample.apply can only aggregate data across rows. So a function like `foo` will not make sense because it does not aggregate data. – cs95 Jul 14 '20 at 19:05
  • @cs95 Gotcha! So you can't `apply` `foo` on a resample object. But you *could* "apply" `foo` by iterating over the object itself (my last code block)? Which seems silly if you could use `groupby` (or just make a better non-loop version) – Tom Jul 15 '20 at 14:30

0 Answers0