0

I have some expressions that I will evaluate later either within or without a window function. This normally works fine. Have pl.col("x").max()—add .over("y") later. Have pl.arange(0, pl.count())—add .over("y") later. One expression this does not work on is pl.count().

If you try to window pl.count(), Polars errors:

import polars as pl

df = pl.DataFrame(dict(x=[1,1,0,0], y=[1,2,3,4]))
expression = pl.count()

df.with_columns([expression.over("x").alias("z")])
# exceptions.ComputeError: Cannot apply a window function, did not find a root column. This is likely due to a syntax error in this expression: count()

Is there a version of count that can handle being windowed? I know that I can do pl.col("x").count().over("x"), but then I have to know ahead of time what columns will exist, and the expressions and the window columns come from completely different parts of my code.

drhagen
  • 8,331
  • 8
  • 53
  • 82

3 Answers3

1

Currently the limitation of window expressions is that they need to have a root column. This means a column that starts the expression that is in the Dataframe's context. pl.count() is an expression that does not refer to any column, and is generic over any column.

We can easily circumvent this limitation with the knowledge that column.len() == count(). So we can just take the first() column in the DataFrame.

df = pl.DataFrame(dict(x=[1,1,0,0], y=[1,2,3,4]))
expression = pl.first().len()

df.with_columns([expression.over("x").alias("z")])
shape: (4, 3)
┌─────┬─────┬─────┐
│ x   ┆ y   ┆ z   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ u32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 2   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 0   ┆ 3   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 0   ┆ 4   ┆ 2   │
└─────┴─────┴─────┘
drhagen
  • 8,331
  • 8
  • 53
  • 82
ritchie46
  • 10,405
  • 1
  • 24
  • 43
0

Not sure if you need it to be a window, but based on the example that you've given a simple groupby would suffice

import polars as pl

df = pl.DataFrame(dict(x=[1,1,0,0], y=[1,2,3,4]))
expression = pl.count()

print(df.groupby(pl.col("x")).agg(expression))
# print(df.select([expression]))
# ^this would work on the whole dataframe
Moriarty Snarly
  • 506
  • 5
  • 9
0

Upgrade to Polars >=0.14. Starting in that release, the behavior in the original question started working without modification.

import polars as pl

df = pl.DataFrame(dict(x=[1,1,0,0], y=[1,2,3,4]))
expression = pl.count()

df.with_columns([expression.over("x").alias("z")])
# shape: (4, 3)
# ┌─────┬─────┬─────┐
# │ x   ┆ y   ┆ z   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 1   ┆ 2   │
# │ 1   ┆ 2   ┆ 2   │
# │ 0   ┆ 3   ┆ 2   │
# │ 0   ┆ 4   ┆ 2   │
# └─────┴─────┴─────┘
drhagen
  • 8,331
  • 8
  • 53
  • 82