apply operations from string list to one (or more) column(s) in polars

Question

I would need to apply multiple simple operations (sum/mean/max/min/median etc) to a single column. Is there a way to write that concisely without repeating myself?

Right now I would need to write all these manually,

df.select(pl.col("a").max(), pl.col("b").mean(), pl.col("b").min())

Whereas in pandas I could pass a list of operations (["max", "min", "mean"]) to agg

Looked through polars documentation and internet and couldn't find anything

Does this answer your question? [Use list comprehension for expression functions](https://stackoverflow.com/questions/75453175/use-list-comprehension-for-expression-functions) — Dean MacGregor, Jul 19 '23 at 11:48

Dean MacGregor · Accepted Answer · 2023-07-19T16:32:16.407

You can use getattr to "convert" the string of the expression you want it to an operable one. In that way you can do this...

df=pl.DataFrame({'a':[1,2,3,4], 'b':[3,4,5,6]})

(
    df
        .select(getattr(pl.col(col), fun)().suffix(f"_{fun}") 
                    for col in ['a','b'] 
                    for fun in ["max", "min", "mean"])
)

shape: (1, 6)
┌───────┬───────┬────────┬───────┬───────┬────────┐
│ a_max ┆ a_min ┆ a_mean ┆ b_max ┆ b_min ┆ b_mean │
│ ---   ┆ ---   ┆ ---    ┆ ---   ┆ ---   ┆ ---    │
│ i64   ┆ i64   ┆ f64    ┆ i64   ┆ i64   ┆ f64    │
╞═══════╪═══════╪════════╪═══════╪═══════╪════════╡
│ 4     ┆ 1     ┆ 2.5    ┆ 6     ┆ 3     ┆ 4.5    │
└───────┴───────┴────────┴───────┴───────┴────────┘

You can take out the for col in ['a','b'] and change the pl.col(col) to pl.all() if you just want all columns.

You can even replicate this syntax {'a' : ['sum', 'min'], 'b' : ['min', 'max']} by using a double iterated generator

(
    df
    .select(getattr(pl.col(col), fun)().suffix(f"_{fun}") 
            for col,funL in {'a' : ['sum', 'min'], 'b' : ['min', 'max']}.items() 
            for fun in funL)
)

Lastly, you can wrap that all up into a function and monkey patch it to pl.DataFrame.agg so you have the direct functionality that you're looking for.

def agg(df, func: str | list | dict) -> pl.DataFrame:
    """Function to replicate pandas agg function, will take either a single string, a list of strings, or a dict mapping columns to functions"""
    if isinstance(func, str):
        func=[func]
    if isinstance(func, list):
        return (
            df
                .select(getattr(pl.all(), fun)().suffix(f"_{fun}") for fun in func)
        )
    elif isinstance(func, dict):
        return (
            df
            .select(getattr(pl.col(col), fun)().suffix(f"_{fun}")
                    for col,funL in func.items()
                    for fun in funL)
        )
pl.DataFrame.agg=agg

Now you can just do

df.agg(['min','max'])
shape: (1, 6)
┌───────┬───────┬───────┬───────┬────────┬────────┐
│ a_min ┆ b_min ┆ a_max ┆ b_max ┆ a_mean ┆ b_mean │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---    │
│ i64   ┆ i64   ┆ i64   ┆ i64   ┆ f64    ┆ f64    │
╞═══════╪═══════╪═══════╪═══════╪════════╪════════╡
│ 1     ┆ 3     ┆ 4     ┆ 6     ┆ 2.5    ┆ 4.5    │
└───────┴───────┴───────┴───────┴────────┴────────┘

or

df.agg({'a' : ['sum', 'min'], 'b' : ['min', 'max']})
shape: (1, 4)
┌───────┬───────┬───────┬───────┐
│ a_sum ┆ a_min ┆ b_min ┆ b_max │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 10    ┆ 1     ┆ 3     ┆ 6     │
└───────┴───────┴───────┴───────┘

score 3 · Answer 2 · answered Jul 19 '23 at 20:24

There is .describe() which may be of use:

>>> df.describe()
shape: (9, 3)
┌────────────┬──────────┬──────────┐
│ describe   ┆ a        ┆ b        │
│ ---        ┆ ---      ┆ ---      │
│ str        ┆ f64      ┆ f64      │
╞════════════╪══════════╪══════════╡
│ count      ┆ 4.0      ┆ 4.0      │
│ null_count ┆ 0.0      ┆ 0.0      │
│ mean       ┆ 2.5      ┆ 4.5      │
│ std        ┆ 1.290994 ┆ 1.290994 │
│ min        ┆ 1.0      ┆ 3.0      │
│ max        ┆ 4.0      ┆ 6.0      │
│ median     ┆ 2.5      ┆ 4.5      │
│ 25%        ┆ 2.0      ┆ 4.0      │
│ 75%        ┆ 4.0      ┆ 6.0      │
└────────────┴──────────┴──────────┘

You could also define your own helper function

pl.Expr.stats = lambda self, *exprs: (
   getattr(self, expr)().suffix(f"_{expr}") for expr in exprs
)

df.select( pl.col("a").stats("max", "min", "mean", "sum") )

shape: (1, 4)
┌───────┬───────┬────────┬───────┐
│ a_max ┆ a_min ┆ a_mean ┆ a_sum │
│ ---   ┆ ---   ┆ ---    ┆ ---   │
│ i64   ┆ i64   ┆ f64    ┆ i64   │
╞═══════╪═══════╪════════╪═══════╡
│ 4     ┆ 1     ┆ 2.5    ┆ 10    │
└───────┴───────┴────────┴───────┘

df.select( pl.all().stats("max", "min", "mean", "sum") )

shape: (1, 8)
┌───────┬───────┬───────┬───────┬────────┬────────┬───────┬───────┐
│ a_max ┆ b_max ┆ a_min ┆ b_min ┆ a_mean ┆ b_mean ┆ a_sum ┆ b_sum │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---    ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   ┆ f64    ┆ f64    ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╪════════╪════════╪═══════╪═══════╡
│ 4     ┆ 6     ┆ 1     ┆ 3     ┆ 2.5    ┆ 4.5    ┆ 10    ┆ 18    │
└───────┴───────┴───────┴───────┴────────┴────────┴───────┴───────┘

It's pretty much the same idea @DeanMacGregor and @ritchie46 showed - just expressed a little differently by defining it as your own extra `pl.Expr` function. — jqurious, Jul 20 '23 at 10:01

score 2 · Answer 3 · answered Jul 19 '23 at 16:50

2

The shortest native way would probably be this:

df.select([pl.all().min().suffix("_min"), pl.all().max().suffix("_max")])

answered Jul 19 '23 at 16:50

Moritz Wilksch

141
2
5

Thanks. It's useful if I want to apply a single operation to multiple columns. I was looking for multiple operations to a single column – Mark Wang Jul 20 '23 at 09:36

score 2 · Answer 4 · answered Jul 20 '23 at 08:10

2

You can do some meta programming:

df = pl.DataFrame({
    "a": [1, 2],
    "b": [3, 2],
})

df.select([
    eval(f"pl.all().{agg}().suffix('_{agg}')") for agg in ["min", "max", "var"]
])

answered Jul 20 '23 at 08:10

ritchie46

10,405
1
24
43

The approach mentioned by @jqurious is pretty interesting – Mark Wang Jul 20 '23 at 09:38

apply operations from string list to one (or more) column(s) in polars

4 Answers4