4

I am hoping to count consecutive values in a column, preferably using Polars expressions.

import polars
df = pl.DataFrame(
   {"values": [True,True,True,False,False,True,False,False,True,True]}
)

With the example data frame above, I would like to count the number of consecutive True values.

Below is example output using R's Data.Table package.

library(data.table)
dt <- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
dt[, value2 := fifelse((1:.N) == .N & value == 1, .N, NA_integer_), by = rleid(value)]
dt
value value2
TRUE NA
TRUE NA
TRUE 3
FALSE NA
FALSE NA
TRUE 1
FALSE NA
FALSE NA
TRUE NA
TRUE 2

Any ideas who this would be done efficiently using Polars?

[EDIT with a new approach]

I got it working with the code below, but hoping there is a more efficient way. Anyone know the default struct/dictionary field names from value_counts?

(
    df.lazy()
    .with_row_count()
    .with_column(
        pl.when(pl.col("value") == False).then(
            pl.col("row_nr")
            
        ).fill_null(
            strategy = "forward"
        ).alias("id_consecutive_Trues")
    )
    .with_column(
        pl.col("id_consecutive_Trues").value_counts(sort = True)
    )
    .with_column(
        (
            pl.col("id_consecutive_Trues").arr.eval(
                pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
            ).arr.max()
            - pl.lit(1)
        ).alias("max_consecutive_true_values")
    )
    .collect()
)
JGrant06
  • 53
  • 4
  • 1
    The `.value_counts` field names are the name of what you're counting and `counts` so `id_consecutive_Trues` and `counts` in this case. – jqurious Feb 09 '23 at 23:07

1 Answers1

5
  • Update: .rle_id() was added in Polars 0.18.7

.rle_id() can be used to group the consecutive values.

df.with_columns(group = pl.col("values").rle_id())
shape: (10, 2)
┌────────┬───────┐
│ values ┆ group │
│ ---    ┆ ---   │
│ bool   ┆ u32   │
╞════════╪═══════╡
│ true   ┆ 0     │
│ true   ┆ 0     │
│ true   ┆ 0     │
│ false  ┆ 1     │
│ false  ┆ 1     │
│ true   ┆ 2     │
│ false  ┆ 3     │
│ false  ┆ 3     │
│ true   ┆ 4     │
│ true   ┆ 4     │
└────────┴───────┘

You can take the count of each group when true changes to false:

df.with_columns(
   pl.when(
      pl.col("values") &
      pl.col("values").shift_and_fill(False, periods=-1).is_not()
   )
   .then(pl.count().over(pl.col("values").rle_id()))
)
shape: (10, 2)
┌────────┬───────┐
│ values ┆ count │
│ ---    ┆ ---   │
│ bool   ┆ u32   │
╞════════╪═══════╡
│ true   ┆ null  │
│ true   ┆ null  │
│ true   ┆ 3     │
│ false  ┆ null  │
│ false  ┆ null  │
│ true   ┆ 1     │
│ false  ┆ null  │
│ false  ┆ null  │
│ true   ┆ null  │
│ true   ┆ 2     │
└────────┴───────┘

The .shift_and_fill() is to account for the case when no false follows i.e. 2 in this example.

jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thanks! I thought to use shift, but figured it would be too expensive. Any insight into whether your solution would be more performant than approach? I will benchmark both. Thank you again for the quick response! – JGrant06 Feb 11 '23 at 18:11
  • 1
    I would think it's more performant as it seems to be doing less work than your example - but I could be wrong. If you're using `.arr.eval` you can try adding [`parallel=True`](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.arr.eval.html#polars.Expr.arr.eval) to see if it makes a difference in your benchmark. – jqurious Feb 11 '23 at 18:15