0

Given a dataframe, I'd like to be able to aggregate the top 10% of a group, something like:

ans = (df
 .groupby("groupId")
 .agg(pl.col('myValue')
    .sort_by(pl.col('order'))
    .first(0.1).mean()
    .alias('mean_of_top_decile')
    )
)

The code above does not work because .first does not take any parameters, and certainly not a float to represent percentages.

Is there any way to do this in polars?

MYK
  • 1,988
  • 7
  • 30
  • 1
    Do you have a test dataframe to work with? Absent that, I'd say [`slice`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.slice.html) with an `offset` of `0` and a `length` of `pl.count() // 10` should work. (I was thinking of `top_k` but that seems to only take an int argument, not an Expr) – Wayoshi Apr 04 '23 at 17:45

1 Answers1

1

I built a sample dataframe. I believe the answer you are looking for is the one by @Wayoshi.

Here is what it looks like:

data = {
    "groupId": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    "myValue": [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100],
    "order": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
}

df = pl.DataFrame(data)

(
    df.groupby("groupId")
    .agg(
        pl.col("myValue")
        .sort_by(pl.col("order"))
        .slice(0, pl.count() // 10)
        .mean()
        .alias("mean_of_top_decile")
    )
)

shape: (1, 2)
┌─────────┬────────────────────┐
│ groupId ┆ mean_of_top_decile │
│ ---     ┆ ---                │
│ i64     ┆ f64                │
╞═════════╪════════════════════╡
│ 1       ┆ 7.5                │
└─────────┴────────────────────┘

All credits to @Wayoshi, I just built the example

Luca
  • 1,216
  • 6
  • 10
  • 1
    Or as pointed out by BallpointBen in a prior answer, `head` is simpler than ``slice(0, ...)`, I had a brain fart there. One thing I think this does not cover is how to handle duplicate values right at the 10%th mark, there's some things you can do depending on how you want to deal with that edge case. – Wayoshi Apr 04 '23 at 18:54
  • 1
    @Wayoshi: you mean .head(pl.count() // 10) ? That gives an error: TypeError: argument 'n': 'Expr' object cannot be interpreted as an integer I just tried Ritchie's example in the SO mentioned and his solution is not working anymore with the same error. not sure if this is intentional – Luca Apr 04 '23 at 19:08