3

I am new to polars and I am not sure whether I am using .with_columns() correctly.

Here's a situation I encounter frequently: There's a dataframe and in .with_columns(), I apply some operation to a column. For example, I convert some dates from str to date type and then want to compute the duration between start and end date. I'd implement this as follows.

import polars as pl 

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    [
        pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
        pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
    ]
).with_columns(
    [
        (pl.col("end") - pl.col("start")).alias("duration"),
    ]
)

First, I convert the two columns, next I call .with_columns() again.

Something shorter like this does not work:

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    [
        pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
        pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
        (pl.col("end") - pl.col("start")).alias("duration"),
    ]
)

Is there a way to avoid calling .with_columns() twice and to write this in a more compact way?

Thomas
  • 1,199
  • 1
  • 14
  • 29

1 Answers1

3

The second .with_columns is needed.

From @DeanMacGregor

To elaborate, everything in a context (with_columns in this case) only knows about what's in the dataframe before the context was called. Each expression in a context is unaware of every other expression in the context. This is by design because all the expressions run in parallel. If you need one expression to know the output of another expression, you need another context.

You could pass multiple names to .col() and use named args instead of .alias()

(df
 .with_columns(
    pl.col("start", "end").str.strptime(pl.Date, fmt="%d.%m.%Y"))
 .with_columns(
    duration = pl.col("end") - pl.col("start")))
shape: (2, 3)
┌────────────┬────────────┬──────────────┐
│ start      | end        | duration     │
│ ---        | ---        | ---          │
│ date       | date       | duration[ms] │
╞════════════╪════════════╪══════════════╡
│ 2019-01-01 | 2019-01-11 | 10d          │
│ 2020-01-01 | 2020-05-01 | 121d         │
└────────────┴────────────┴──────────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thanks! The `fmt` argument is slightly different for `start` and `end` in the actual data I use, but I'll keep the suggestions in mind :) – Thomas Mar 01 '23 at 10:01
  • 2
    @Thomas To elaborate, everything in a context (`with_columns` in this case) only knows about what's in the dataframe before the context was called. Each expression in a context is unaware of every other expression in the context. This is by design because all the expressions run in parallel. If you need one expression to know the output of another expression, you need another context. – Dean MacGregor Mar 01 '23 at 11:28
  • Thanks for the clarification @DeanMacGregor - I've added your comment to my answer as I think it's worthwhile information. Hope that's okay. – jqurious Mar 01 '23 at 14:13