2

I would need to apply multiple simple operations (sum/mean/max/min/median etc) to a single column. Is there a way to write that concisely without repeating myself?

Right now I would need to write all these manually,

df.select(pl.col("a").max(), pl.col("b").mean(), pl.col("b").min())

Whereas in pandas I could pass a list of operations (["max", "min", "mean"]) to agg

Looked through polars documentation and internet and couldn't find anything

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
Mark Wang
  • 2,623
  • 7
  • 15

4 Answers4

4

You can use getattr to "convert" the string of the expression you want it to an operable one. In that way you can do this...

df=pl.DataFrame({'a':[1,2,3,4], 'b':[3,4,5,6]})

(
    df
        .select(getattr(pl.col(col), fun)().suffix(f"_{fun}") 
                    for col in ['a','b'] 
                    for fun in ["max", "min", "mean"])
)

shape: (1, 6)
┌───────┬───────┬────────┬───────┬───────┬────────┐
│ a_max ┆ a_min ┆ a_mean ┆ b_max ┆ b_min ┆ b_mean │
│ ---   ┆ ---   ┆ ---    ┆ ---   ┆ ---   ┆ ---    │
│ i64   ┆ i64   ┆ f64    ┆ i64   ┆ i64   ┆ f64    │
╞═══════╪═══════╪════════╪═══════╪═══════╪════════╡
│ 4     ┆ 1     ┆ 2.5    ┆ 6     ┆ 3     ┆ 4.5    │
└───────┴───────┴────────┴───────┴───────┴────────┘

You can take out the for col in ['a','b'] and change the pl.col(col) to pl.all() if you just want all columns.

You can even replicate this syntax {'a' : ['sum', 'min'], 'b' : ['min', 'max']} by using a double iterated generator

(
    df
    .select(getattr(pl.col(col), fun)().suffix(f"_{fun}") 
            for col,funL in {'a' : ['sum', 'min'], 'b' : ['min', 'max']}.items() 
            for fun in funL)
)

Lastly, you can wrap that all up into a function and monkey patch it to pl.DataFrame.agg so you have the direct functionality that you're looking for.

def agg(df, func: str | list | dict) -> pl.DataFrame:
    """Function to replicate pandas agg function, will take either a single string, a list of strings, or a dict mapping columns to functions"""
    if isinstance(func, str):
        func=[func]
    if isinstance(func, list):
        return (
            df
                .select(getattr(pl.all(), fun)().suffix(f"_{fun}") for fun in func)
        )
    elif isinstance(func, dict):
        return (
            df
            .select(getattr(pl.col(col), fun)().suffix(f"_{fun}")
                    for col,funL in func.items()
                    for fun in funL)
        )
pl.DataFrame.agg=agg

Now you can just do

df.agg(['min','max'])
shape: (1, 6)
┌───────┬───────┬───────┬───────┬────────┬────────┐
│ a_min ┆ b_min ┆ a_max ┆ b_max ┆ a_mean ┆ b_mean │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---    │
│ i64   ┆ i64   ┆ i64   ┆ i64   ┆ f64    ┆ f64    │
╞═══════╪═══════╪═══════╪═══════╪════════╪════════╡
│ 1     ┆ 3     ┆ 4     ┆ 6     ┆ 2.5    ┆ 4.5    │
└───────┴───────┴───────┴───────┴────────┴────────┘

or

df.agg({'a' : ['sum', 'min'], 'b' : ['min', 'max']})
shape: (1, 4)
┌───────┬───────┬───────┬───────┐
│ a_sum ┆ a_min ┆ b_min ┆ b_max │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 10    ┆ 1     ┆ 3     ┆ 6     │
└───────┴───────┴───────┴───────┘
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
3

There is .describe() which may be of use:

>>> df.describe()
shape: (9, 3)
┌────────────┬──────────┬──────────┐
│ describe   ┆ a        ┆ b        │
│ ---        ┆ ---      ┆ ---      │
│ str        ┆ f64      ┆ f64      │
╞════════════╪══════════╪══════════╡
│ count      ┆ 4.0      ┆ 4.0      │
│ null_count ┆ 0.0      ┆ 0.0      │
│ mean       ┆ 2.5      ┆ 4.5      │
│ std        ┆ 1.290994 ┆ 1.290994 │
│ min        ┆ 1.0      ┆ 3.0      │
│ max        ┆ 4.0      ┆ 6.0      │
│ median     ┆ 2.5      ┆ 4.5      │
│ 25%        ┆ 2.0      ┆ 4.0      │
│ 75%        ┆ 4.0      ┆ 6.0      │
└────────────┴──────────┴──────────┘

You could also define your own helper function

pl.Expr.stats = lambda self, *exprs: (
   getattr(self, expr)().suffix(f"_{expr}") for expr in exprs
)
df.select( pl.col("a").stats("max", "min", "mean", "sum") )
shape: (1, 4)
┌───────┬───────┬────────┬───────┐
│ a_max ┆ a_min ┆ a_mean ┆ a_sum │
│ ---   ┆ ---   ┆ ---    ┆ ---   │
│ i64   ┆ i64   ┆ f64    ┆ i64   │
╞═══════╪═══════╪════════╪═══════╡
│ 4     ┆ 1     ┆ 2.5    ┆ 10    │
└───────┴───────┴────────┴───────┘
df.select( pl.all().stats("max", "min", "mean", "sum") )
shape: (1, 8)
┌───────┬───────┬───────┬───────┬────────┬────────┬───────┬───────┐
│ a_max ┆ b_max ┆ a_min ┆ b_min ┆ a_mean ┆ b_mean ┆ a_sum ┆ b_sum │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---    ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   ┆ f64    ┆ f64    ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╪════════╪════════╪═══════╪═══════╡
│ 4     ┆ 6     ┆ 1     ┆ 3     ┆ 2.5    ┆ 4.5    ┆ 10    ┆ 18    │
└───────┴───────┴───────┴───────┴────────┴────────┴───────┴───────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • okay I have to admit that the stats approach is pretty cool – Mark Wang Jul 20 '23 at 09:38
  • 1
    It's pretty much the same idea @DeanMacGregor and @ritchie46 showed - just expressed a little differently by defining it as your own extra `pl.Expr` function. – jqurious Jul 20 '23 at 10:01
2

The shortest native way would probably be this:

df.select([pl.all().min().suffix("_min"), pl.all().max().suffix("_max")])
Moritz Wilksch
  • 141
  • 2
  • 5
  • Thanks. It's useful if I want to apply a single operation to multiple columns. I was looking for multiple operations to a single column – Mark Wang Jul 20 '23 at 09:36
2

You can do some meta programming:

df = pl.DataFrame({
    "a": [1, 2],
    "b": [3, 2],
})

df.select([
    eval(f"pl.all().{agg}().suffix('_{agg}')") for agg in ["min", "max", "var"]
])
ritchie46
  • 10,405
  • 1
  • 24
  • 43