3

I have a data frame in pandas with a DateTime index. When grouping it with a time grouper: pd.Grouper(freq='360Min'), how can I join this result back onto the original timestamp? I.e. an equijoin timestamp=bucket will not work? Is there a convenience function? Should the asof join be used? Or would I manually have to extract the hours and then try to match it up?

example:

for a source of

import pandas as pd
df = pd.DataFrame(
   {
       "Publish date": [
            pd.Timestamp("2000-01-02"),
            pd.Timestamp("2000-01-02"),
            pd.Timestamp("2000-01-09"),
            pd.Timestamp("2000-01-16")
        ],
        "ID": [0, 1, 2, 3],
        "Price": [10, 20, 30, 40]
    }
)

Which gives:

  Publish date  ID  Price
0   2000-01-02   0     10
1   2000-01-02   1     20
2   2000-01-09   2     30
3   2000-01-16   3     40

I want to perform an aggregation with an arbitrary frequency (not only month, day, hour), let's say 1

month.

agg_result = df.groupby(pd.Grouper(key="Publish date", freq="1M")).agg([pd.Series.mean, pd.Series.median]).reset_index()
agg_result.columns = ['_'.join(col).strip() for col in agg_result.columns.values]
agg_result.columns = ['Publish date month', 'ID_mean', 'ID_median', 'Price_mean', 'Price_median']
print(agg_result)
Publish date month  ID_mean  ID_median  Price_mean  Price_median
0         2000-01-31      1.5        1.5          25            25

How can I ensure that the equijoin would work again? I.e. transform the original timestamp into the fitting bucket using the same arbitrary frequency?

I.e. described in the code of the example, how can I get:

agg_result['Publish date month'] = agg_result['Publish date'].apply(magic transform to same frequency bucket)
df.merge(agg_result, on['Publish date month'])

To work, i.e. define the transformation to the right bucket?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

2 Answers2

3

EDIT:

The easiest way to identify the corresponding original values for each group should be:

gb = df.groupby(pd.Grouper(key="Publish date", freq="1M"))
dict(list(gb['Publish date']))

You can then use this to join any information back to the original table.


Can you just join on two intermediate columns?

df['Publish date'].dt.month

and

df.groupby(pd.Grouper(key="Publish date", freq="1M")).agg([pd.Series.mean, pd.Series.median]).index.month

like this

results =  df.groupby(pd.Grouper(key="Publish date", freq="1M")).agg([pd.Series.mean, pd.Series.median])

results.columns = ['-'.join(col[::-1]).strip() for col in results.columns]

df['month'] = df['Publish date'].dt.month

results['month'] = results.index.month
results.merge(df)
Adam Zeldin
  • 898
  • 4
  • 6
  • But this will only work in the case of month. Not when i.e. supplying 4 hours as the frequency. – Georg Heiler Sep 10 '20 at 22:27
  • 1
    Ah got it — that makes sense. I suppose you could still look at the groups df.groupby(pd.Grouper(key="Publish date", freq="1M")).groups and then look up the included dates / times. df.groupby(pd.Grouper(key="Publish date", freq="1M")).get_group(pd.Timestamp('2000-01-31 00:00:00', freq='M')) to build a join key. I'll follow this to see if someone has a better way to do it. – Adam Zeldin Sep 10 '20 at 22:50
  • @GeorgHeiler please see my edit for a more straightforward approach. – Adam Zeldin Sep 11 '20 at 13:48
  • also good. Both works on small data - and both failed on the real data ;) but in principle, they solve the problem. As this one is using fewer JOINs I guess this one might be preferred and I will mark it as accepted. – Georg Heiler Sep 11 '20 at 14:50
  • @GeorgHeiler thank you. What's the error your facing? Is it during the groupby agg function or during the join? If it's the join, you could try creating a custom groupby function to return the groups with the aggregate information (e.g. median, mean) along with the group itself. If you provide a one-liner to create a randomized dataset build from numpy that still gives you the error, I'd be delighted to take a look. – Adam Zeldin Sep 11 '20 at 15:44
  • During the resolution of the keys from the hashmap. – Georg Heiler Sep 11 '20 at 16:49
1

I would use the Groupby.transform method:

import pandas as pd
df = pd.DataFrame(
   {
       "Publish date": [
            pd.Timestamp("2000-01-02"),
            pd.Timestamp("2000-01-02"),
            pd.Timestamp("2000-01-09"),
            pd.Timestamp("2000-01-16")
        ],
        "ID": [0, 1, 2, 3],
        "Price": [10, 20, 30, 40]
    }
)

g = df.groupby(pd.Grouper(key="Publish date", freq="1M"))

(
  df.join(g.transform('mean'), rsuffix='_mean')
    .join(g.transform('median'), rsuffix='_median')
)

And that returns:

  Publish date  ID  Price  ID_mean  Price_mean  ID_median  Price_median
0   2000-01-02   0     10      1.5          25        1.5            25
1   2000-01-02   1     20      1.5          25        1.5            25
2   2000-01-09   2     30      1.5          25        1.5            25
3   2000-01-16   3     40      1.5          25        1.5            25

You can also use pandas.concat in place of DataFrame.join:

methods = ['mean', 'median', 'std', 'min', 'max']

pd.concat([
    df, *[g.transform(m).add_suffix(f'_{m}') for m in methods]
], axis='columns')

And that gives you:

  Publish date  ID  Price  ID_mean  Price_mean  ID_median  Price_median    ID_std  Price_std  ID_min  Price_min  ID_max  Price_max
0   2000-01-02   0     10      1.5          25        1.5            25  1.290994  12.909944       0         10       3         40
1   2000-01-02   1     20      1.5          25        1.5            25  1.290994  12.909944       0         10       3         40
2   2000-01-09   2     30      1.5          25        1.5            25  1.290994  12.909944       0         10       3         40
3   2000-01-16   3     40      1.5          25        1.5            25  1.290994  12.909944       0         10       3         40
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • Nice, but if I would want to keep the list (which is flexible to be passed as a function argument) would it be possible to danymically build the `df.join.transform(xxxx)`? – Georg Heiler Sep 10 '20 at 22:53
  • @GeorgHeiler yeah you can use `concat` with list comprehsions. See the edit. – Paul H Sep 10 '20 at 22:58
  • 1
    Neat. But I guess that the comment below `df.groupby(pd.Grouper(key="Publish date", freq="1M")).get_group(pd.Timestamp('2000-01-31 00:00:00', freq='M'))` is better as yours repeatedly requires a JOIN (which might be slow= and this one only uses a single apply UDF and then a single JOIN. Is this correct? – Georg Heiler Sep 10 '20 at 23:04
  • @GeorgHeiler I have no idea. You should profile the possible answers. – Paul H Sep 10 '20 at 23:13