How to the get the first n% of a group in polars?

Question

Q1: In polars-rust, when you do .groupby().agg() , we can use .head(10) to get the first 10 elements in a column. But if the groups have different lengths and I need to get first 20% elements in each group (like 0-24 elements in a 120 elements group). How to make it work?
Q2: with a dataframe sample like below, my goal is to loop the dataframe. Beacuse polars is column major, so I downcasted df into serval ChunkedArrays and iterated via iter().zip().I found it is faster than the same action after goupby(col("date")) which is loop some list elemnts. How is that? In my opinion, the length of df is shorter after groupby, which means a shorter loop.

Date	Stock	Price
2010-01-01	IBM	1000
2010-01-02	IBM	1001
2010-01-03	IBM	1002
2010-01-01	AAPL	2900
2010-01-02	AAPL	2901
2010-01-03	AAPL	2902

score 3 · Answer 1 · answered Feb 27 '22 at 18:59

I don't really understand your 2nd question. Maybe you can create another question with a small example.

I will answer the 1st question:

we can use head(10) to get the first 10 elements in a col. But if the groups have different length and I need to get first 20% elements in each group like 0-24 elements in a 120 elements group. how to make it work?

We can use expressions to take a head(n) where n = 0.2 group_size.

df = pl.DataFrame({
    "groups": ["a"] * 10 + ["b"] * 20,
    "values": range(30)
})

(df.groupby("groups")
    .agg(pl.all().head(pl.count() * 0.2))
    .explode(pl.all().exclude("groups"))
)

which outputs:

shape: (6, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ str    ┆ i64    │
╞════════╪════════╡
│ a      ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a      ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 10     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 11     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 12     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 13     │
└────────┴────────┘

How to the get the first n% of a group in polars?

1 Answers1

Linked

Related