Lazy filter depending on the previous line (Polars Python)

Question

I'm using Python Polars and I have a table like this :

Column1	Column2
id1	1
id1	1
id1	2
id1	1
id1	1
id1	2
id1	3

I would like, using Polars Lazy API, to have the result when column2 previous element is different from the actual column2 element. So the result would look like this after the operation :

Column1	Column2
id1	1
id1	2
id1	1
id1	2
id1	3

Thanks !

score 1 · Answer 1 · answered Feb 25 '22 at 13:58

Use the shift expression.

import polars as pl

df = pl.DataFrame(
    {"Column1": ["id1"] * 7, "Column2": [1, 1, 2, 1, 1, 2, 3]}).lazy()

df.filter(pl.col("Column2") != pl.col("Column2").shift(periods=1)).collect()

shape: (5, 2)
┌─────────┬─────────┐
│ Column1 ┆ Column2 │
│ ---     ┆ ---     │
│ str     ┆ i64     │
╞═════════╪═════════╡
│ id1     ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 3       │
└─────────┴─────────┘

You can find documentation about the options here: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.shift.html#polars.Expr.shift

Note that you can reverse the direction of the shift as well as the count using the periods argument.

And there is also a version shift_and_fill that will fill the None values created as a result of the shift.

This solution work but it will always miss one result (depending if periods is 1 or -1 then the head or tail will not be count in the result). So I just added before the filter, a save of the first row with `first = lazyFrame.first()` then after the filter, I just use `polars.concat` with first and lazyFrame and it gives me the expected result, thanks ! — Zorp, Feb 25 '22 at 15:04
Actually, you may not need to do this. Indeed, you may accidentally create a duplicate record by concatenating the first record to your results. I've elaborated on my answer. I hope this helps. — , Feb 25 '22 at 18:20

score 1 · Accepted Answer · answered Feb 25 '22 at 18:19

Let me elaborate on how shift and shift_and_fill work. The use of these comes down to strategy (and knowing your data).

Using shift

Let's start with this data set:

import polars as pl
df = pl.DataFrame({"row_num": range(1, 8),
                   "Column2": [1, 2, 3, 3, 4, 5, 4]}).lazy()
df.collect()

shape: (7, 2)
┌─────────┬─────────┐
│ row_num ┆ Column2 │
│ ---     ┆ ---     │
│ i64     ┆ i64     │
╞═════════╪═════════╡
│ 1       ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       │
└─────────┴─────────┘

Now, let's create intermediate columns to see how these functions work.

(df
    .with_column(pl.col("Column2").shift().alias("Column2_shifted"))
    .with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result"))
).collect()

shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ ---     ┆ ---     ┆ ---             ┆ ---           │
│ i64     ┆ i64     ┆ i64             ┆ bool          │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1       ┆ null            ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       ┆ 1               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       ┆ 2               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       ┆ 3               ┆ false         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       ┆ 3               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       ┆ 4               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       ┆ 5               ┆ true          │
└─────────┴─────────┴─────────────────┴───────────────┘

Notice how in the first row, the Column2_shifted has a null (really, None) value in the first row.

But more importantly, the result of (pl.col("Column2") != pl.col("Column2_shifted")) is True for the first row.

Thus, as long as null values are not allowed in Column2, the first row will be included. You don't need to separately concatenate the first row of the dataset to your results.

Note: in practice, you don't need these intermediate columns. You can simply use .filter(pl.col("Column2") != pl.col("Column2").shift()). The intermediate columns are just for explanatory purposes here.

Using shift_and_fill

If None/null values are allowed in Column2, then you can try using shift_and_fill and selecting a fill_value that is not allowed in Column2.

For example, if you know that negative numbers are not allowed in Column2, you can use this logic.

(df
    .with_column(pl.col("Column2").shift_and_fill(periods=1, fill_value=-1).alias("Column2_shifted"))
    .with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result"))
).collect()

shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ ---     ┆ ---     ┆ ---             ┆ ---           │
│ i64     ┆ i64     ┆ i64             ┆ bool          │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1       ┆ -1              ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       ┆ 1               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       ┆ 2               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       ┆ 3               ┆ false         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       ┆ 3               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       ┆ 4               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       ┆ 5               ┆ true          │
└─────────┴─────────┴─────────────────┴───────────────┘

With this strategy, the first row is always included, without having to separately concatenate the first row to your results. That's because you've purposely selected a fill_value that will never match any value in Column2.

Adding is_first to the expression

If you are unsure of which values are allowed in Column2 (even None), then I would suggest appending is_first to your expression (rather than concatenating the first row to your results dataset):

(df
    .with_column(pl.col("Column2").shift().alias("Column2_shifted"))
    .with_column((pl.col("Column2").is_first() | (pl.col("Column2") != pl.col("Column2_shifted"))).alias("not_eq_result"))
).collect()

shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ ---     ┆ ---     ┆ ---             ┆ ---           │
│ i64     ┆ i64     ┆ i64             ┆ bool          │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1       ┆ null            ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       ┆ 1               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       ┆ 2               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       ┆ 3               ┆ false         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       ┆ 3               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       ┆ 4               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       ┆ 5               ┆ true          │
└─────────┴─────────┴─────────────────┴───────────────┘

This forces your first row to evaluate to True, simply because it is first. (Just be very careful with the nested paratheses in the expression - or you may not get the result you expect.

Does this help clarify things?

Oh yeah I understand now why it doesn't took up the first row. To be honest, I totally forgot about that `is_first()` function. I should give more look at the docs. Thanks for your time and explanations ! — Zorp, Feb 25 '22 at 22:48

Lazy filter depending on the previous line (Polars Python)

2 Answers2