5

I am trying to move from pandas to polars but I am running into the following issue.

import polars as pl

df = pl.DataFrame(
    {
        "integer": [1, 2, 3], 
        "date": [
            "2010-01-31T23:00:00+00:00",
            "2010-02-01T00:00:00+00:00",
            "2010-02-01T01:00:00+00:00"
        ]
    }
)
df = df.with_columns(
    [
        pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
    ]
)

Yields the following dataframe:

>>> df
shape: (3, 2)
┌─────────┬────────────────────────────────┐
│ integer ┆ date                           │
│ ---     ┆ ---                            │
│ i64     ┆ datetime[μs, Europe/Amsterdam] │
╞═════════╪════════════════════════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        │
│ 2       ┆ 2010-02-01 01:00:00 CET        │
│ 3       ┆ 2010-02-01 02:00:00 CET        │
└─────────┴────────────────────────────────┘

As you can see, I transformed the datetime string from UTC to CET succesfully. However, when I try to extract the date (using the accepted answer by the polars author in this thread: https://stackoverflow.com/a/73212748/16332690), it seems to extract the date from the UTC string even though it has been transformed, e.g.:

df = df.with_columns(
    [
        pl.col("date").cast(pl.Date).alias("valueDay")
    ]
)
>>> df
shape: (3, 3)
┌─────────┬────────────────────────────────┬────────────┐
│ integer ┆ date                           ┆ valueDay   │
│ ---     ┆ ---                            ┆ ---        │
│ i64     ┆ datetime[μs, Europe/Amsterdam] ┆ date       │
╞═════════╪════════════════════════════════╪════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        ┆ 2010-01-31 │
│ 2       ┆ 2010-02-01 01:00:00 CET        ┆ 2010-02-01 │
│ 3       ┆ 2010-02-01 02:00:00 CET        ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴────────────┘

The valueDay should be 2010-02-01 for all 3 values.

Can anyone help me fix this? By the way, what is the best way to optimize this code? Do I constantly have to assign everything to df or is there a way to chain all of this?

Edit:

I managed to find a quick way around this but it would be nice if the issue above could be addressed. A pandas dt.date like way to approach this would be nice, I noticed that it is missing over here: https://pola-rs.github.io/polars/py-polars/html/reference/series/timeseries.html

df = pl.DataFrame(
    {
        "integer": [1, 2, 3], 
        "date": [
            "2010-01-31T23:00:00+00:00",
            "2010-02-01T00:00:00+00:00",
            "2010-02-01T01:00:00+00:00"
        ]
    }
)
df = df.with_columns(
    [
        pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
    ]
)
df = df.with_columns(
    [
        pl.col("date").dt.day().alias("day"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.year().alias("year"),
    ]
)
df = df.with_columns(
    pl.datetime(year=pl.col("year"), month=pl.col("month"), day=pl.col("day"))
)
df = df.with_columns(
    [
        pl.col("datetime").cast(pl.Date).alias("valueDay")
    ]
)

Yields the following:

>>> df
shape: (3, 7)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┬─────────────────────┬────────────┐
│ integer ┆ date                           ┆ day ┆ month ┆ year ┆ datetime            ┆ valueDay   │
│ ---     ┆ ---                            ┆ --- ┆ ---   ┆ ---  ┆ ---                 ┆ ---        │
│ i64     ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32   ┆ i32  ┆ datetime[μs]        ┆ date       │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╪═════════════════════╪════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 2       ┆ 2010-02-01 01:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 3       ┆ 2010-02-01 02:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┴─────────────────────┴────────────┘
brokkoo
  • 157
  • 7
  • Extracting a date in a certain timezone via string is another option as shown below, whereas there should be more beautiful way. `df.with_columns(pl.col('date').dt.strftime('%Y-%m-%d').str.strptime(pl.Date, '%Y-%m-%d').alias('valueDay'))` – quasi-human Jan 28 '23 at 15:15
  • 5
    There is an [active issue regarding `dt.date`](https://github.com/pola-rs/polars/issues/6130) - you can chain `.with_columns().with_columns()` – jqurious Jan 28 '23 at 16:08

2 Answers2

2

For completeness, in polars 0.16.15+, dt namespace methods .date(), .time() etc. return local datetime attributes. EX:

import polars as pl

df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "date": [
            "2010-01-31T23:00:00+00:00",
            "2010-02-01T00:00:00+00:00",
            "2010-02-01T01:00:00+00:00",
        ],
    }
)
df = df.with_columns(pl.col("date").str.strptime(pl.Datetime, fmt="%+")
                     .dt.convert_time_zone("Europe/Amsterdam"))

df = df.with_columns(
    [
        pl.col("date").dt.day().alias("day"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.year().alias("year"),
    ]
)

print(df)

┌─────────┬────────────────────────────────┬─────┬───────┬──────┐
│ integer ┆ date                           ┆ day ┆ month ┆ year │
│ ---     ┆ ---                            ┆ --- ┆ ---   ┆ ---  │
│ i64     ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32   ┆ i32  │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 │
│ 2       ┆ 2010-02-01 01:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 │
│ 3       ┆ 2010-02-01 02:00:00 CET        ┆ 1   ┆ 2     ┆ 2010 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┘
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
1

There's nothing wrong with the cast function per se. It's just not intended to be used for truncating the time out of a datetime.

As it happens the function you're looking for is truncate so this is what you want to do (in the last with_columns chunk)

pl.DataFrame(
    {
        "integer": [1, 2, 3], 
        "date": [
            "2010-01-31T23:00:00+00:00",
            "2010-02-01T00:00:00+00:00",
            "2010-02-01T01:00:00+00:00"
        ]
    }
).with_columns(
    [
        pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
    ]
).with_columns(
    [
        pl.col("date").dt.truncate('1d').alias("valueDay")
    ]
)
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • From the documentation it looks like the ```truncate``` method also isn't meant to extract the date from a datetime column either and therefore this seems like a workaround. I think we can agree that what is missing here is a polars.Expr.dt.date method which actually extracts the date from an underlying Date representation. – brokkoo Feb 03 '23 at 10:17
  • I'm not sure why you say that. `truncate` can do *more* than just truncating to the date of a datetime but that doesn't mean it isn't meant to do exactly this. – Dean MacGregor Feb 07 '23 at 12:01