2
  • Q1: In polars-rust, when you do .groupby().agg() , we can use .head(10) to get the first 10 elements in a column. But if the groups have different lengths and I need to get first 20% elements in each group (like 0-24 elements in a 120 elements group). How to make it work?

  • Q2: with a dataframe sample like below, my goal is to loop the dataframe. Beacuse polars is column major, so I downcasted df into serval ChunkedArrays and iterated via iter().zip().I found it is faster than the same action after goupby(col("date")) which is loop some list elemnts. How is that? In my opinion, the length of df is shorter after groupby, which means a shorter loop.

Date Stock Price
2010-01-01 IBM 1000
2010-01-02 IBM 1001
2010-01-03 IBM 1002
2010-01-01 AAPL 2900
2010-01-02 AAPL 2901
2010-01-03 AAPL 2902
pedrosaurio
  • 4,708
  • 11
  • 39
  • 53
Hakase
  • 211
  • 1
  • 12

1 Answers1

3

I don't really understand your 2nd question. Maybe you can create another question with a small example.

I will answer the 1st question:

we can use head(10) to get the first 10 elements in a col. But if the groups have different length and I need to get first 20% elements in each group like 0-24 elements in a 120 elements group. how to make it work?

We can use expressions to take a head(n) where n = 0.2 group_size.

df = pl.DataFrame({
    "groups": ["a"] * 10 + ["b"] * 20,
    "values": range(30)
})

(df.groupby("groups")
    .agg(pl.all().head(pl.count() * 0.2))
    .explode(pl.all().exclude("groups"))
)

which outputs:

shape: (6, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ str    ┆ i64    │
╞════════╪════════╡
│ a      ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a      ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 10     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 11     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 12     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b      ┆ 13     │
└────────┴────────┘

ritchie46
  • 10,405
  • 1
  • 24
  • 43