2

I have a dataframe with customers and order dates. Order dates can be either a pl.Date or a null, but currently this column contains only null values. I want to create a new column, "startdate_before_override", which is set to either the "order_date" value of the same row or the "startdate" value (a constant of dtype pl.Date, declared at the third line).

I have tried doing this with the when/then/otherwise pattern, adding a check for nulls as the first "when" (edit: I see now that this check is actually not needed, but that's beside the point of my issue).

The problem I am trying to solve is that the values returned to startdate_before_override are pl.Float32 and not pl.Date.

import polars as pl

startdate = pl.date(year=2023, month=1, day=1)

data = {"customer": ["Mike", "Jan"], "order_date": [None, None]}

df = pl.DataFrame(data)
(df
.with_columns(
    pl
    .when(pl.col("order_date").is_null())
    .then(startdate)
    .when(pl.col("order_date")>startdate)
    .then(pl.col("order_date"))
    .otherwise(startdate)
    .alias("startdate_before_override"),
)
)

The output is not as I expect:

shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ ---      ┆ ---        ┆ ---                       │
│ str      ┆ f32        ┆ f32                       │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike     ┆ null       ┆ 19358.0                   │
│ Jan      ┆ null       ┆ 19358.0                   │
└──────────┴────────────┴───────────────────────────┘

Since both rows have null values for the order_date col, I would expect the result to be 2023-01-01 in a pl.Date format. To verify that it is actually the first .then that is returned, we can alter the code as shown here and check the new output:

In [8]: (df
   ...: .with_columns(
   ...:     pl
   ...:     .when(pl.col("order_date").is_null())
   ...:     .then("test")
   ...:     .when(pl.col("order_date")>startdate)
   ...:     .then(pl.col("order_date"))
   ...:     .otherwise(startdate)
   ...:     .alias("startdate_before_override"),
   ...: )
   ...: )
Out[8]: 
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ ---      ┆ ---        ┆ ---                       │
│ str      ┆ f32        ┆ str                       │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike     ┆ null       ┆ test                      │
│ Jan      ┆ null       ┆ test                      │
└──────────┴────────────┴───────────────────────────┘

However, if I change both checks to return "startdate" it does work, for reasons I can't wrap my head around:

In [10]: (df
    ...: .with_columns(
    ...:     pl
    ...:     .when(pl.col("order_date").is_null())
    ...:     .then(startdate)
    ...:     .when(pl.col("order_date")>startdate)
    ...:     .then(startdate)
    ...:     .otherwise(startdate)
    ...:     .alias("startdate_before_override"),
    ...: )
    ...: )
Out[10]: 
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ ---      ┆ ---        ┆ ---                       │
│ str      ┆ f32        ┆ date                      │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike     ┆ null       ┆ 2023-01-01                │
│ Jan      ┆ null       ┆ 2023-01-01                │
└──────────┴────────────┴───────────────────────────┘

Any ideas about what is going on here? I.e. why is the first example not returning a pl.Date value, but the third example is?

Edit: Issue seems to be connected to the dtype of order_date, which is pl.Float32 (see the comment from @jqurious).

Casting order_date to pl.Date in the .when statements does not resolve the issue. But when I specify the dtype for order_date at the creation time of the dataframe, things work as expected:

df = pl.DataFrame(data, schema={"customer": pl.Utf8, "order_date": pl.Date})

I don't understand why my third example did return a pl.Date while the first did not, though, but my original issue is resolved.

isherwood
  • 58,414
  • 16
  • 114
  • 157
Balthazar
  • 81
  • 9

1 Answers1

1

In your example the order_date column is all nulls and defaulting to the f32 dtype.

As your when/then can produce startdate or order_date, i.e. either an f32 or pl.Date value, Polars calculates the "supertype", which in this case is f32.

startdate is being cast to a float, which is where 19358.0 is coming from.

>>> pl.select(startdate.cast(float))
shape: (1, 1)
┌─────────┐
│ date    │
│ ---     │
│ f64     │
╞═════════╡
│ 19358.0 │
└─────────┘

You can either explicitly set the type of the column to date on creation, e.g. with schema= / schema_overrides=

df = pl.DataFrame(data, schema_overrides={"order_date": pl.Date})

df.with_columns(
   pl.when(pl.col("order_date") > startdate)
     .then(pl.col("order_date"))
     .otherwise(startdate)
     .alias("startdate_before_override"),
)

Or you can explicitly .cast() the result of the .when().then()

df.with_columns(
   pl.when(pl.col("order_date") > startdate)
     .then(pl.col("order_date"))
     .otherwise(startdate)
     .cast(pl.Date)
     .alias("startdate_before_override"),
)
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ ---      ┆ ---        ┆ ---                       │
│ str      ┆ date       ┆ date                      │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike     ┆ null       ┆ 2023-01-01                │
│ Jan      ┆ null       ┆ 2023-01-01                │
└──────────┴────────────┴───────────────────────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14