I have a dataframe with customers and order dates. Order dates can be either a pl.Date or a null, but currently this column contains only null values. I want to create a new column, "startdate_before_override", which is set to either the "order_date" value of the same row or the "startdate" value (a constant of dtype pl.Date, declared at the third line).
I have tried doing this with the when/then/otherwise pattern, adding a check for nulls as the first "when" (edit: I see now that this check is actually not needed, but that's beside the point of my issue).
The problem I am trying to solve is that the values returned to startdate_before_override are pl.Float32 and not pl.Date.
import polars as pl
startdate = pl.date(year=2023, month=1, day=1)
data = {"customer": ["Mike", "Jan"], "order_date": [None, None]}
df = pl.DataFrame(data)
(df
.with_columns(
pl
.when(pl.col("order_date").is_null())
.then(startdate)
.when(pl.col("order_date")>startdate)
.then(pl.col("order_date"))
.otherwise(startdate)
.alias("startdate_before_override"),
)
)
The output is not as I expect:
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ --- ┆ --- ┆ --- │
│ str ┆ f32 ┆ f32 │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike ┆ null ┆ 19358.0 │
│ Jan ┆ null ┆ 19358.0 │
└──────────┴────────────┴───────────────────────────┘
Since both rows have null values for the order_date col, I would expect the result to be 2023-01-01 in a pl.Date format. To verify that it is actually the first .then that is returned, we can alter the code as shown here and check the new output:
In [8]: (df
...: .with_columns(
...: pl
...: .when(pl.col("order_date").is_null())
...: .then("test")
...: .when(pl.col("order_date")>startdate)
...: .then(pl.col("order_date"))
...: .otherwise(startdate)
...: .alias("startdate_before_override"),
...: )
...: )
Out[8]:
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ --- ┆ --- ┆ --- │
│ str ┆ f32 ┆ str │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike ┆ null ┆ test │
│ Jan ┆ null ┆ test │
└──────────┴────────────┴───────────────────────────┘
However, if I change both checks to return "startdate" it does work, for reasons I can't wrap my head around:
In [10]: (df
...: .with_columns(
...: pl
...: .when(pl.col("order_date").is_null())
...: .then(startdate)
...: .when(pl.col("order_date")>startdate)
...: .then(startdate)
...: .otherwise(startdate)
...: .alias("startdate_before_override"),
...: )
...: )
Out[10]:
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ --- ┆ --- ┆ --- │
│ str ┆ f32 ┆ date │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike ┆ null ┆ 2023-01-01 │
│ Jan ┆ null ┆ 2023-01-01 │
└──────────┴────────────┴───────────────────────────┘
Any ideas about what is going on here? I.e. why is the first example not returning a pl.Date value, but the third example is?
Edit: Issue seems to be connected to the dtype of order_date, which is pl.Float32 (see the comment from @jqurious).
Casting order_date to pl.Date in the .when statements does not resolve the issue. But when I specify the dtype for order_date at the creation time of the dataframe, things work as expected:
df = pl.DataFrame(data, schema={"customer": pl.Utf8, "order_date": pl.Date})
I don't understand why my third example did return a pl.Date while the first did not, though, but my original issue is resolved.