0

I was wondering if someone could please enlighten me.

I am trying to cumulatively sum pty_nber over/groupby a specific column (Declaration).

My original idea was to use something along:

dataset.filter(pl.col("pty_nber").first().over("Declaration").cumsum() < 30 )

But unfortunately, it does not take into account the .over() and just cumulatively sums all the rows. So rather than summing 4 + 7 + 8 etc.., it sums it 4 + 4 + 4 + 4 + 7 ...

Code Example

The goal is to show at least a few complete declarations and not cut in the middle.

Thanks in advance :)

As an example please see below:

enter image description here

--> and filter out for CUMSUM that are over a certain threshold such as 30 so that I make sure that no ONE declaration is not complete (i.e. not including all the pty_nber for that specific declaration)

enter image description here

  • Can you add a minimal working example? Some input data and the output data you'd like to see. – ritchie46 Oct 19 '22 at 13:16
  • @ritchie46 Hi Ritchie really appreciate you reaching out. I have added a bit more explanation. Let me know if you need further clarity. But basically I would like to limit the number of row I display (let's take 30 rows for example), however, I do not want one of my declarations to NOT include all of its items (ITEM NUMBER). So I would rather show Declaration 2, 3 & 4 fully even though it is less than 30 rows, than having only a few items of declaration 5. That's why I was using the cumsum over declaration. I hope that makes sense. Thanks in advance :) – McNickSisto Oct 19 '22 at 15:29

1 Answers1

0

So I am not sure if you want just to have the yellow marked entries or keep the whole dataframe and just remove the rows where CUMSUM is greater than 30. So I implemented in two ways:

Data

I simplified a little bit your example data

import polars as pl
df = pl.DataFrame(
    {
        "declaration": [2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
        "item": [1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4],
        "pty_nber": [12, 12, 12, 9, 9, 9, 9, 16, 16, 16, 16],
    }
)

First:

df.groupby("declaration", maintain_order=True).first().filter(
    pl.col("pty_nber").cumsum() < 30
)


shape: (2, 3)
┌─────────────┬──────┬──────────┐
│ declaration ┆ item ┆ pty_nber │
│ ---         ┆ ---  ┆ ---      │
│ i64         ┆ i64  ┆ i64      │
╞═════════════╪══════╪══════════╡
│ 2           ┆ 1    ┆ 12       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3           ┆ 1    ┆ 9        │
└─────────────┴──────┴──────────┘

Second:

For the second I use the help of the ITEM column, since I suppose the first element each group equals the first item

df.filter(
    pl.when(pl.col("item") == 1).then(pl.col("pty_nber")).otherwise(0).cumsum() < 30
)


shape: (7, 3)
┌─────────────┬──────┬──────────┐
│ declaration ┆ item ┆ pty_nber │
│ ---         ┆ ---  ┆ ---      │
│ i64         ┆ i64  ┆ i64      │
╞═════════════╪══════╪══════════╡
│ 2           ┆ 1    ┆ 12       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2           ┆ 2    ┆ 12       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2           ┆ 3    ┆ 12       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3           ┆ 1    ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3           ┆ 2    ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3           ┆ 3    ┆ 9        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3           ┆ 4    ┆ 9        │
└─────────────┴──────┴──────────┘
alexp
  • 697
  • 4
  • 9