0
import polars as pl
import pandas as pd


A = ['a','a','a','a','a','a','a','b','b','b','b','b','b','b']
B = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]


df = pl.DataFrame({'cola':A,
                   'colb':B})


df_pd = df.to_pandas()

index = df_pd.groupby('cola')['colb'].idxmax()
df_pd.loc[index,'top'] = 1

in pandas i can get the column of top using idxmax().

however, in polars

i use the arg_max()

index = df[pl.col('colb').arg_max().over('cola').flatten()]

seems cannot get what i want..

is there any way to get generate a column of 'top' in polars?

thx a lot!

thunderwu
  • 33
  • 5

1 Answers1

1

In Polars, window functions (the .over()) will do an aggregation + self-join (see https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.over.html?highlight=over#polars.Expr.over), which means you cannot return a unique value per row, which is what you are after.

A way to compute the top column is to use apply:

df.groupby("cola").apply(lambda x: x.with_columns([pl.col("colb"), (pl.col("colb")==pl.col("colb").max()).alias("top")]))
jvz
  • 1,183
  • 6
  • 13