3

When grouping a Polars dataframe in Python, how do you concatenate string values from a single column across rows within each group?

For example, given the following DataFrame:

import polars as pl

df = pl.DataFrame(
    {
        "col1": ["a", "b", "a", "b", "c"],
        "col2": ["val1", "val2", "val1", "val3", "val3"]
    }
)

Original df:

shape: (5, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ a    ┆ val1 │
│ b    ┆ val2 │
│ a    ┆ val1 │
│ b    ┆ val3 │
│ c    ┆ val3 │
└──────┴──────┘

I want to run a groupby operation, like:


df.groupby('col1').agg(
    col2_g = pl.col('col2').some_function_like_join(',')
)

The expected output is:

┌──────┬───────────┐
│ col1 ┆ col2_g    │
│ ---  ┆ ---       │
│ str  ┆ str       │
╞══════╪═══════════╡
│ a    ┆ val1,val1 │
│ b    ┆ val2,val3 │
│ c    ┆ val3      │
└──────┴───────────┘

What is the name of the some_function_like_join function?

I have tried the following methods, and none work:

df.groupby('col1').agg(pl.col('col2').arr.concat(','))
df.groupby('col1').agg(pl.col('col2').join(','))
df.groupby('col1').agg(pl.col('col2').arr.join(','))

2 Answers2

3

If you want to concatenate them, I assume you want the result as a string with your specified delimiter:

out = df.groupby("col1").agg(
    pl.col("col2").str.concat(",")
)

Result:

shape: (3, 2)
┌──────┬───────────┐
│ col1 ┆ col2      │
│ ---  ┆ ---       │
│ str  ┆ str       │
╞══════╪═══════════╡
│ a    ┆ val1,val1 │
│ b    ┆ val2,val3 │
│ c    ┆ val3      │
└──────┴───────────┘

If you want them within a List, you simply do:

out = df.groupby("col1").agg(
    pl.col("col2")
)

Result:

shape: (3, 2)
┌──────┬──────────────────┐
│ col1 ┆ col2             │
│ ---  ┆ ---              │
│ str  ┆ list[str]        │
╞══════╪══════════════════╡
│ a    ┆ ["val1", "val1"] │
│ c    ┆ ["val3"]         │
│ b    ┆ ["val2", "val3"] │
└──────┴──────────────────┘
Pep_8_Guardiola
  • 5,002
  • 1
  • 24
  • 35
0

I think the most straightforward way is to do a with_columns after the agg. The aggregated columns will be a List dtype:

df.groupby('col1').agg(pl.col('col2')).with_columns(pl.col('col2').arr.concat(','))
Wayoshi
  • 1,688
  • 1
  • 7