0

So when i iterate through a pandas.groupby() what i get back is a tuple. This was important because i could do [x for x in df_pandas.sort('date').groupby('grouping_column')] and then sort this list of tuples based on x[0].

In pandas it's also autosorted after a groupby

I did that to have a constant output in plotly. (Area chart)

Now with polars, i can't do the same. I just get the dataframe back. Is there any way to accomplish the same?

I tried adding a sort([pl.col('date'), pl.col('grouping_column') but it had no effect.

What's in my mind for polars is this:

for value in df.select('grouping_column').uniqeue().to_numpy():
    df = df.filter(pl.column('grouping_column') == value)
    ...

This will in fact give the desired results, because it will always iterate through the same sequence, while the groupby is kinda random and the order doesn't seem to matter at all.

My problem is it that the second solution seems to be not really efficient.

The other thing i could do is

[(sub_df['some_col'].to_numpy()[0], sub_df) for sub_df in df.groupby('some_col')]

Use then pythons sort to sort the list based on key in the tuple x[0] and then reiterate through the list. However this solution seems super ugly as well.

supersick
  • 261
  • 2
  • 14

1 Answers1

3

You can use the partition_by function to create a dictionary of key-value pairs, where the keys are your grouping_column and your values are a DataFrame.

For example, let's say we have this data:

import polars as pl
from datetime import datetime

df = pl.DataFrame({"grouping_column": [1, 2, 3], }).join(
    pl.DataFrame(
        {
            "date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 3, 1), "1mo"),
        }
    ),
    how="cross",
)
df
shape: (9, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date                │
│ ---             ┆ ---                 │
│ i64             ┆ datetime[ns]        │
╞═════════════════╪═════════════════════╡
│ 1               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1               ┆ 2020-03-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...             ┆ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 2020-03-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘

We can split the DataFrame into a dictionary.

df.partition_by(by='grouping_column', maintain_order=True, as_dict=True)
{1: shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date                │
│ ---             ┆ ---                 │
│ i64             ┆ datetime[ns]        │
╞═════════════════╪═════════════════════╡
│ 1               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1               ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘,
 2: shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date                │
│ ---             ┆ ---                 │
│ i64             ┆ datetime[ns]        │
╞═════════════════╪═════════════════════╡
│ 2               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘, 
3: shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date                │
│ ---             ┆ ---                 │
│ i64             ┆ datetime[ns]        │
╞═════════════════╪═════════════════════╡
│ 3               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘}

From there, you can create the tuples using the items method of the Python's dictionary.

for x in df.partition_by(by='grouping_column', maintain_order=True, as_dict=True).items():
    print("next item")
    print(x)
next item
(1, shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date                │
│ ---             ┆ ---                 │
│ i64             ┆ datetime[ns]        │
╞═════════════════╪═════════════════════╡
│ 1               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1               ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘)
next item
(2, shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date                │
│ ---             ┆ ---                 │
│ i64             ┆ datetime[ns]        │
╞═════════════════╪═════════════════════╡
│ 2               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘)
next item
(3, shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date                │
│ ---             ┆ ---                 │
│ i64             ┆ datetime[ns]        │
╞═════════════════╪═════════════════════╡
│ 3               ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3               ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘)
Ilya
  • 45
  • 1
  • 10
  • I added ```df_sorted_group = sorted((x for x in df.partition_by(groups=group_by_pick, maintain_order=True, as_dict=True).items()), key=lambda x: x[0], ) for group, sub_df in df_sorted_group:``` I did that because i wanted to keep the order of the sort of another column before running this part of the code – supersick Jun 24 '22 at 17:43
  • The as_dict is also a little bit misleading since it's a tuple and not a dict – supersick Jun 24 '22 at 18:08
  • It's a dictionary. `type(df.partition_by(groups='grouping_column', maintain_order=True, as_dict=True))` returns `` –  Jun 24 '22 at 18:17