1

I have the following code

import polars as pl

df = pl.DataFrame(
{
    "grpbyKey": [1, 1, 1, 2, 2, 2],
    "val": ["One"] * 3 + ["Two"] * 3
}
)
df2 = pl.DataFrame(
{
    "grpbyKey": [1, 1, 2, 2, 2, 3],
    "val2": ["One"] * 2 + ["Two"] * 3 + ["Three"]
}
)
c = df.lazy().with_context(df2.lazy())
result = c.groupby("grpbyKey").agg([pl.all()]).collect()

print(result)

It gives the following result:

shape: (2, 3)
┌──────────┬───────────────────────┬─────────────────────────┐
│ grpbyKey ┆ val                   ┆ val2                    │
│ ---      ┆ ---                   ┆ ---                     │
│ i64      ┆ list[str]             ┆ list[str]               │
╞══════════╪═══════════════════════╪═════════════════════════╡
│ 1        ┆ ["One", "One", "One"] ┆ ["One", "One", "Two"]   │
│ 2        ┆ ["Two", "Two", "Two"] ┆ ["Two", "Two", "Three"] │
└──────────┴───────────────────────┴─────────────────────────┘

I was hoping to see

shape: (2, 3)
┌──────────┬───────────────────────┬─────────────────────────┐
│ grpbyKey ┆ val                   ┆ val2                    │
│ ---      ┆ ---                   ┆ ---                     │
│ i64      ┆ list[str]             ┆ list[str]               │
╞══════════╪═══════════════════════╪═════════════════════════╡
│ 1        ┆ ["One", "One", "One"] ┆ ["One", "One"]          │
│ 2        ┆ ["Two", "Two", "Two"] ┆ ["Two", "Two", "Two"]   │
└──────────┴───────────────────────┴─────────────────────────┘

I.e. it groups both dataframes at the same time. Within the groupby I intend to run a custom function on the two.

Is there a way to get the groupby to give me both frames grouped?

I would like to use the polars API since I intend to implement this in Rust eventually, so no Python hacks please.

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • 2
    Is that not just both groupby's joined? `df.groupby("grpbyKey").all().join(df2.groupby("grpbyKey").all(), on="grpbyKey")` – jqurious Jun 09 '23 at 14:33
  • Coming from pandas, I had no idea you could join group-bys. *mind blown* Thanks! Each groupby dataframe can contain tens of millions of entries, is getting them as lists possibly bad for performance? – The Unfun Cat Jun 10 '23 at 15:03
  • Also, you should make a reply with some more context and I'll accept your answer – The Unfun Cat Jun 10 '23 at 15:04
  • 1
    Ah okay. I thought perhaps I was missing something which is why I asked. Technically it's joining dataframes as you're calling `.all()`, so "joining the groupbys" was probably badly phrased on my part. As for lists and performance, depending on the overall goal and what you're doing with the result it's possible there could be an alternative approach - it's hard to say from the current information. – jqurious Jun 10 '23 at 15:13
  • Here is another question where I ask about how to use the lists in the different columns in the same operation: https://stackoverflow.com/questions/76447151/use-multiple-columns-in-list-expression – The Unfun Cat Jun 10 '23 at 16:39
  • For a flat result you could concat them: `pl.concat([df, df2], how='diagonal')` - an "anti join" can be used to detect non-matches (`3` in this case) which could then be removed: `df2.join(df, how='anti', on='grpbyKey')`. Not sure if that kind of shape is useful to you. Is there a particular end result in my mind? Will you be using the `.search_sorted` result in some way? – jqurious Jun 10 '23 at 17:49

1 Answers1

3

For a bit of extra context to jquirious's answer, with_context will work here as a temporary horizontal concatenation, e.g.


print(df.lazy().with_context(df2.lazy()).select(pl.all()).collect())
┌──────────┬─────┬───────┐
│ grpbyKey ┆ val ┆ val2  │
│ ---      ┆ --- ┆ ---   │
│ i64      ┆ str ┆ str   │
╞══════════╪═════╪═══════╡
│ 1        ┆ One ┆ One   │
│ 1        ┆ One ┆ One   │
│ 1        ┆ One ┆ Two   │
│ 2        ┆ Two ┆ Two   │
│ 2        ┆ Two ┆ Two   │
│ 2        ┆ Two ┆ Three │
└──────────┴─────┴───────┘

but notice that since there are two "grpbyKey" columns to potentially choose from, it takes the one from the base dataframe. This explains why your groupby is then wrong. The correct answer, as stated above, is to do the groupby first and then join. with_context, in my experience, is rarely the correct answer to most standard operations, a straight join behaves better in situations where you need to aggregate info between frames.

mishpat
  • 66
  • 2
  • 1
    since you're making an answer you should give the full answer in addition to the context. It's nice to give attribution to the comment but the answer should stand on its own, not just refer to another place with the answer. – Dean MacGregor Jun 09 '23 at 16:12
  • I wasn't sure if I was interpreting the question correctly which is why I asked via a comment instead of posting an answer. Feel free to add the code from the comment to your answer, or include your own approach. – jqurious Jun 10 '23 at 15:17