Let me elaborate on how shift
and shift_and_fill
work. The use of these comes down to strategy (and knowing your data).
Using shift
Let's start with this data set:
import polars as pl
df = pl.DataFrame({"row_num": range(1, 8),
"Column2": [1, 2, 3, 3, 4, 5, 4]}).lazy()
df.collect()
shape: (7, 2)
┌─────────┬─────────┐
│ row_num ┆ Column2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════════╪═════════╡
│ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 4 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 4 │
└─────────┴─────────┘
Now, let's create intermediate columns to see how these functions work.
(df
.with_column(pl.col("Column2").shift().alias("Column2_shifted"))
.with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result"))
).collect()
shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ bool │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1 ┆ 1 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 3 ┆ 3 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 4 ┆ 3 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 4 ┆ 5 ┆ true │
└─────────┴─────────┴─────────────────┴───────────────┘
Notice how in the first row, the Column2_shifted
has a null
(really, None
) value in the first row.
But more importantly, the result of (pl.col("Column2") != pl.col("Column2_shifted"))
is True
for the first row.
Thus, as long as null
values are not allowed in Column2
, the first row will be included. You don't need to separately concatenate the first row of the dataset to your results.
Note: in practice, you don't need these intermediate columns. You can simply use .filter(pl.col("Column2") != pl.col("Column2").shift())
. The intermediate columns are just for explanatory purposes here.
Using shift_and_fill
If None
/null
values are allowed in Column2
, then you can try using shift_and_fill
and selecting a fill_value
that is not allowed in Column2
.
For example, if you know that negative numbers are not allowed in Column2
, you can use this logic.
(df
.with_column(pl.col("Column2").shift_and_fill(periods=1, fill_value=-1).alias("Column2_shifted"))
.with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result"))
).collect()
shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ bool │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1 ┆ 1 ┆ -1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 3 ┆ 3 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 4 ┆ 3 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 4 ┆ 5 ┆ true │
└─────────┴─────────┴─────────────────┴───────────────┘
With this strategy, the first row is always included, without having to separately concatenate the first row to your results. That's because you've purposely selected a fill_value
that will never match any value in Column2
.
Adding is_first
to the expression
If you are unsure of which values are allowed in Column2
(even None
), then I would suggest appending is_first
to your expression (rather than concatenating the first row to your results dataset):
(df
.with_column(pl.col("Column2").shift().alias("Column2_shifted"))
.with_column((pl.col("Column2").is_first() | (pl.col("Column2") != pl.col("Column2_shifted"))).alias("not_eq_result"))
).collect()
shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ bool │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1 ┆ 1 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 2 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 3 ┆ 3 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 4 ┆ 3 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 4 ┆ 5 ┆ true │
└─────────┴─────────┴─────────────────┴───────────────┘
This forces your first row to evaluate to True
, simply because it is first. (Just be very careful with the nested paratheses in the expression - or you may not get the result you expect.
Does this help clarify things?