How to replicate df.groupby('some_column').resample('Q').agg('total':'count') in polars with groupby_dynamic

Question

Given i want to group quarterly, then in pandas i would write (given that the index contains the 'date' column)

df = df.groupby('some_column').resample('Q').agg({"total": "count"})

I tried to replicate that with polars:

df = df.groupby_dynamic('date', every='3mo', by='some_column', include_boundaries=True).agg(pl.count())

In pandas i get the end of the quarter back. In polars the lower boundary because truncate is default set to True. How would i get the _upper_boundary back as my column 'date' ?

Obviously i could just run a selecte after setting include_boundaries=True but i actually want to set it to false because the docs say it will impact performance if set to true

Is every='3mo' equal to pandas resample('Q') ?

Examples:

df = pd.DataFrame({'date': pd.date_range("2018-01-01", periods=365*24, freq='H'})
# Create second column 'pokemon_type':
df['pokemon_type'] = 'fire'
df.loc[400:, 'pokemon_type'] = 'water'

                    date pokemon_type
0    2018-01-01 00:00:00         fire
1    2018-01-01 01:00:00         fire
2    2018-01-01 02:00:00         fire
3    2018-01-01 03:00:00         fire
4    2018-01-01 04:00:00         fire
                  ...          ...
8755 2018-12-31 19:00:00        water
8756 2018-12-31 20:00:00        water
8757 2018-12-31 21:00:00        water
8758 2018-12-31 22:00:00        water
8759 2018-12-31 23:00:00        water
[8760 rows x 2 columns]

Now groupby:
df['total'] = 0
df = df.set_index('date') 
df = df.groupby('pokemon_type').resample('Q').agg({'total':'count'})

df
pokemon_type date             
fire         2018-03-31    400
water        2018-03-31   1760
             2018-06-30   2184
             2018-09-30   2208
             2018-12-31   2208
df.index
MultiIndex([( 'fire', '2018-03-31'),
            ('water', '2018-03-31'),
            ('water', '2018-06-30'),
            ('water', '2018-09-30'),
            ('water', '2018-12-31')],
           names=['pokemon_type', 'date'])

For Polars: Sadly i don't know how to slice a column and assign sub values. Tried to figure it out but i am still on it.

The polars df will look like this

     pokemon_type ┆ _lower_boundary     ┆ _upper_boundary     ┆ date ┆ count │


│ fire      ┆ 1994-01-01 00:00:00 ┆ 1994-04-01 00:00:00 ┆ 1994-01-01 ┆ 58    │

│ water       ┆ 1994-01-01 00:00:00 ┆ 1994-04-01 00:00:00 ┆ 1994-01-01 ┆ 38    │

... What i need is the _upper_boundary directly without having performance issues.

I tried to do so and updated my first comment. I'll try to update it again tomorrow if it's not enough. @ritchie46 — supersick, Jun 22 '22 at 16:44

score 2 · Accepted Answer · answered Jun 23 '22 at 05:40

First, let's recreate your data (including the slice) using Polars. We'll take this in steps.

Calculating Dates

First, we'll use the date_range function in Polars to create our dates.

import polars as pl
from datetime import datetime

df = pl.DataFrame({
    'date': pl.date_range(datetime(2018, 11, 28), datetime(2019, 11, 28), '1h', closed='left'),
})
df

shape: (8760, 1)
┌─────────────────────┐
│ date                │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2018-11-28 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2018-11-28 01:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2018-11-28 02:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2018-11-28 03:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 20:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 21:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 22:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 23:00:00 │
└─────────────────────┘

Slicing

Next, we'll use the arange expression, along with a when/then/otherwise expression to slice our data and assign the pokemon_type.

df = df.with_columns([
    pl.when(pl.arange(0, pl.count()) < 400)
    .then('fire')
    .otherwise('water')
    .alias('pokemon_type')
])
df

shape: (8760, 2)
┌─────────────────────┬──────────────┐
│ date                ┆ pokemon_type │
│ ---                 ┆ ---          │
│ datetime[ns]        ┆ str          │
╞═════════════════════╪══════════════╡
│ 2018-11-28 00:00:00 ┆ fire         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2018-11-28 01:00:00 ┆ fire         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2018-11-28 02:00:00 ┆ fire         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2018-11-28 03:00:00 ┆ fire         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 20:00:00 ┆ water        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 21:00:00 ┆ water        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 22:00:00 ┆ water        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-11-27 23:00:00 ┆ water        │
└─────────────────────┴──────────────┘

Calculating quarter-end dates

To obtain our end-quarter dates, we can use the when/then/otherwise and the datetime expression to create them from the date field.

Note: I'll set include_bondaries=True, just to compare our calculated end-quarter date to the _upper_boundary date.

(
    df
    .groupby_dynamic("date", every="3mo", by="pokemon_type", include_boundaries=True)
    .agg(pl.count())
    .with_column(
        pl.when(pl.col("date").dt.month() < 4)
        .then(pl.datetime(pl.col('date').dt.year(), 4, 1))
        .when(pl.col("date").dt.month() < 7)
        .then(pl.datetime(pl.col("date").dt.year(), 7, 1))
        .when(pl.col("date").dt.month() < 10)
        .then(pl.datetime(pl.col("date").dt.year(), 10, 1))
        .otherwise(pl.datetime(pl.col("date").dt.year() + 1, 1, 1))
        .alias("quarter")
    )
)

shape: (6, 6)
┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬─────────────────────┐
│ pokemon_type ┆ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ count ┆ quarter             │
│ ---          ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---   ┆ ---                 │
│ str          ┆ datetime[ns]        ┆ datetime[ns]        ┆ datetime[ns]        ┆ u32   ┆ datetime[ms]        │
╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╪═════════════════════╡
│ fire         ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 400   ┆ 2019-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 417   ┆ 2019-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-01-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2160  ┆ 2019-04-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-04-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2184  ┆ 2019-07-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-07-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2208  ┆ 2019-10-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-10-01 00:00:00 ┆ 2020-01-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 1391  ┆ 2020-01-01 00:00:00 │
└──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴─────────────────────┘

Likewise, if you wanted to use closed="left", you could also express your quarter-ending dates like this:

(
    df
    .groupby_dynamic("date", every="3mo", by="pokemon_type", closed='left', include_boundaries=True)
    .agg(pl.count())
    .with_column(
        pl.when(pl.col("date").dt.month() < 4)
        .then(pl.datetime(pl.col('date').dt.year(), 3, 31))
        .when(pl.col("date").dt.month() < 7)
        .then(pl.datetime(pl.col("date").dt.year(), 6, 30))
        .when(pl.col("date").dt.month() < 10)
        .then(pl.datetime(pl.col("date").dt.year(), 9, 30))
        .otherwise(pl.datetime(pl.col("date").dt.year(), 12, 31))
        .alias("quarter")
    )
)

shape: (6, 6)
┌──────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬─────────────────────┐
│ pokemon_type ┆ _lower_boundary     ┆ _upper_boundary     ┆ date                ┆ count ┆ quarter             │
│ ---          ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---   ┆ ---                 │
│ str          ┆ datetime[ns]        ┆ datetime[ns]        ┆ datetime[ns]        ┆ u32   ┆ datetime[ms]        │
╞══════════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╪═════════════════════╡
│ fire         ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 400   ┆ 2018-12-31 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2018-10-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2018-10-01 00:00:00 ┆ 416   ┆ 2018-12-31 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-01-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2019-01-01 00:00:00 ┆ 2160  ┆ 2019-03-31 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-04-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2019-04-01 00:00:00 ┆ 2184  ┆ 2019-06-30 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-07-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 2019-07-01 00:00:00 ┆ 2208  ┆ 2019-09-30 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ water        ┆ 2019-10-01 00:00:00 ┆ 2020-01-01 00:00:00 ┆ 2019-10-01 00:00:00 ┆ 1392  ┆ 2019-12-31 00:00:00 │
└──────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴─────────────────────┘

It's a nice solution but the it's hard to determine for closed 'left' where the ending is. June could be 30 or 31 days and so on. — supersick, Jun 25 '22 at 10:53

datenzauber.ai · Answer 2 · 2022-06-25T09:56:39.680

I see mainly two differences comparing polars and pandas here:

polars uses the start of the quarter (as you already pointed out)
polars won't fill in missing quarters

I'm not replicating your exact example to make this a little shorter:

Pandas

pd_df = pd.DataFrame(data={'value': 0}, index=pd.date_range(start='1/1/2022', periods=12, freq='M'))
pd_df = pd_df[(pd_df.index >= '2022-02-01') & (pd_df.index.quarter != 3)]
pd_df.resample('Q').count()

┌────────────┬───────┐
│            ┆ value │
╞════════════╪═══════╡
│ 2022-03-31 ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-06-30 ┆ 3     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-09-30 ┆ 0     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-12-31 ┆ 3     │
└────────────┴───────┘

Polars

Polars Default

pl_df = pl.DataFrame(pd_df.reset_index(), columns=[('index', pl.Date), ('value', pl.Int32)])
pl_df.groupby_dynamic('index', every='3mo').agg(pl.count())

┌────────────┬───────┐
│ index      ┆ value │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-04-01 ┆ 3     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-10-01 ┆ 3     │
└────────────┴───────┘

Polars with Quarter End

This simply adds 3 months and subtracts a day to get the quarter end. Would be nice to have a native way in polars to do this without apply, because this can be slow for lots of data.

from dateutil.relativedelta import relativedelta
(pl_df
    .groupby_dynamic('index', every='3mo')
    .agg(pl.count())
    .with_column(pl.col('index').apply(lambda x: x + relativedelta(months=3) - relativedelta(days=1)))
)

┌────────────┬───────┐
│ index      ┆ value │
╞════════════╪═══════╡
│ 2022-03-31 ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-06-30 ┆ 3     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-12-31 ┆ 3     │
└────────────┴───────┘

Polars filled with missing quarters

This adds the missing quarters and fills with 0.

result = (pl_df
    .groupby_dynamic('index', every='3mo')
    .agg(pl.count())
)
# casting because date_range only accepts datetime values
quarters = result['index'].cast(pl.Datetime)
complete_quarters = pl.date_range(quarters.min(), quarters.max(), '3mo', name='index').cast(pl.Date)

# filling missing quarters and shifting to quarter end
(result
  .join(pl.DataFrame(complete_quarters), on='index', how='outer')
  .with_columns([
      pl.col('count').fill_null('zero'),
      pl.col('index').apply(lambda x: x + relativedelta(months=3) - relativedelta(days=1))
  ])
)

┌────────────┬───────┐
│ index      ┆ value │
╞════════════╪═══════╡
│ 2022-03-31 ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-06-30 ┆ 3     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-09-30 ┆ 0     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-12-31 ┆ 3     │
└────────────┴───────┘