2

Test Data:

import polars as pl
import pandas as pd
from datetime import date, time, datetime

# polars df for Jan to 12th, March 2022 before DST transition.
tdf = pl.DataFrame(
    pl.date_range(
        low=date(2022, 1, 3),
        high=date(2022, 3, 12),
        interval="5m",
        time_unit="ns",
        time_zone="UTC",
    ).alias("UTC")
)

My use case is to convert UTC data (or timezone-unaware) to EST/EDT data (or timezone-aware), then filter the data on a time range (9:30 to 16:00).

tz_replaced_filtered = tdf.with_columns( 
    pl.col("UTC").dt.replace_time_zone(time_zone="US/Eastern").alias("NY")
).filter(
  pl.col("NY").cast(pl.Time).is_between(time(9), time(16), closed="both")
)

# output
shape: (5780, 2)
┌─────────────────────────┬──────────────────────────┐
│ UTC                     ┆ NY                       │
│ ---                     ┆ ---                      │
│ datetime[ns, UTC]       ┆ datetime[ns, US/Eastern] │
╞═════════════════════════╪══════════════════════════╡
│ 2022-01-03 04:00:00 UTC ┆ 2022-01-03 04:00:00 EST  │
│ 2022-01-03 04:05:00 UTC ┆ 2022-01-03 04:05:00 EST  │
│ 2022-01-03 04:10:00 UTC ┆ 2022-01-03 04:10:00 EST  │
│ 2022-01-03 04:15:00 UTC ┆ 2022-01-03 04:15:00 EST  │
│ …                       ┆ …                        │
│ 2022-03-11 10:45:00 UTC ┆ 2022-03-11 10:45:00 EST  │
│ 2022-03-11 10:50:00 UTC ┆ 2022-03-11 10:50:00 EST  │
│ 2022-03-11 10:55:00 UTC ┆ 2022-03-11 10:55:00 EST  │
│ 2022-03-11 11:00:00 UTC ┆ 2022-03-11 11:00:00 EST  │
└─────────────────────────┴──────────────────────────┘

Apparently there's no obvious 9 to 16 time range in this resulted df.

I probed further:

tz_replaced_filtered.with_columns(
   pl.col("NY").cast(pl.Time).alias("NY_TIME"),
   pl.col("UTC").cast(pl.Time).alias("UTC_TIME")
)

# output
shape: (5780, 4)
┌─────────────────────────┬──────────────────────────┬──────────┬──────────┐
│ UTC                     ┆ NY                       ┆ NY_TIME  ┆ UTC_TIME │
│ ---                     ┆ ---                      ┆ ---      ┆ ---      │
│ datetime[ns, UTC]       ┆ datetime[ns, US/Eastern] ┆ time     ┆ time     │
╞═════════════════════════╪══════════════════════════╪══════════╪══════════╡
│ 2022-01-03 04:00:00 UTC ┆ 2022-01-03 04:00:00 EST  ┆ 09:00:00 ┆ 04:00:00 │
│ 2022-01-03 04:05:00 UTC ┆ 2022-01-03 04:05:00 EST  ┆ 09:05:00 ┆ 04:05:00 │
│ 2022-01-03 04:10:00 UTC ┆ 2022-01-03 04:10:00 EST  ┆ 09:10:00 ┆ 04:10:00 │
│ 2022-01-03 04:15:00 UTC ┆ 2022-01-03 04:15:00 EST  ┆ 09:15:00 ┆ 04:15:00 │
│ …                       ┆ …                        ┆ …        ┆ …        │
│ 2022-03-11 10:45:00 UTC ┆ 2022-03-11 10:45:00 EST  ┆ 15:45:00 ┆ 10:45:00 │
│ 2022-03-11 10:50:00 UTC ┆ 2022-03-11 10:50:00 EST  ┆ 15:50:00 ┆ 10:50:00 │
│ 2022-03-11 10:55:00 UTC ┆ 2022-03-11 10:55:00 EST  ┆ 15:55:00 ┆ 10:55:00 │
│ 2022-03-11 11:00:00 UTC ┆ 2022-03-11 11:00:00 EST  ┆ 16:00:00 ┆ 11:00:00 │
└─────────────────────────┴──────────────────────────┴──────────┴──────────┘

The underlying timestamps does seem to show that I succeed in filtering on time range 9 to 16; and it is only the string representation which isn't in sync.

 tdf.with_columns(
     pl.col("UTC").dt.replace_time_zone(time_zone="US/Eastern").alias("NY_REPLACED"),
     pl.col("UTC").dt.convert_time_zone(time_zone="US/Eastern").alias("NY_CONVERTED")
 )

# output
shape: (19585, 3)
┌─────────────────────────┬──────────────────────────┬──────────────────────────┐
│ UTC                     ┆ NY_REPLACED              ┆ NY_CONVERTED             │
│ ---                     ┆ ---                      ┆ ---                      │
│ datetime[ns, UTC]       ┆ datetime[ns, US/Eastern] ┆ datetime[ns, US/Eastern] │
╞═════════════════════════╪══════════════════════════╪══════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-03 00:00:00 EST  ┆ 2022-01-02 19:00:00 EST  │
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-03 00:05:00 EST  ┆ 2022-01-02 19:05:00 EST  │
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-03 00:10:00 EST  ┆ 2022-01-02 19:10:00 EST  │
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-03 00:15:00 EST  ┆ 2022-01-02 19:15:00 EST  │
│ …                       ┆ …                        ┆ …                        │
│ 2022-03-11 23:45:00 UTC ┆ 2022-03-11 23:45:00 EST  ┆ 2022-03-11 18:45:00 EST  │
│ 2022-03-11 23:50:00 UTC ┆ 2022-03-11 23:50:00 EST  ┆ 2022-03-11 18:50:00 EST  │
│ 2022-03-11 23:55:00 UTC ┆ 2022-03-11 23:55:00 EST  ┆ 2022-03-11 18:55:00 EST  │
│ 2022-03-12 00:00:00 UTC ┆ 2022-03-12 00:00:00 EST  ┆ 2022-03-11 19:00:00 EST  │
└─────────────────────────┴──────────────────────────┴──────────────────────────┘

This final step concludes that replaced_time_zone and convert_time_zone does not generate the same string representation for an identical time zone conversion operation.

More specifically, replaced_time_zone does replace the underlying timestamps however it does not "replace" the string representation; what replaced_time_zone does on the surface for UTC (or simply timezone-unaware timestamps) is like with_time_zone, i.e., simply attaching time zone info to existing time stamps.

I'd appreciate if someone could clarify above.

Question Background

This is for polars users and devs such that all questions related to timezone conversion raised by me for polars can be easily referenced.

I firstly raised this to see how we could filter based on converted timestamps (utc offset accounted for), then as polars provided replace_time_zone a question for it is also raised.

A related question metioned by @FObersteiner is here.

For the time being my understanding is that replace_time_zone should provide the same string representation as convert_time_zone, because essentially what they are doing it the same. And this would make filtering on date/time range in polars WYSIWYG (What You See Is What You Get).

Background Information

Polars provides replace_time_zone and convert_time_zone to handle scenarios like this.

convert_time_zone changes only the string representation presented to the user, not the underlying timestamps. Therefore cast is not going to work with "converted" timestamps in my use case (filtering would still be on the orignal underlying timestamps).

Regarding replace_time_zone, DST transitions is causing problem when user tries to replace_time_zone and this issue seems to be universal given that particular timestamp just doesn't exist because of DST transition:

# demo polars df
tpl = pl.DataFrame(
   pl.date_range(
        low=date(2022, 3, 13),
        high=date(2022, 3, 15),
        interval="5m",
        time_unit="ns",
        time_zone="UTC",
   ).alias("UTC")
)

tpl.select(
   pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("NY")
)

# Panic Exception: no such local time

# demo pandas pf
tpn = pd.DataFrame(
    pd.date_range(start="2022-03-13", 
                  end="2022-03-15",
                  freq="5T", # 5 minute interval
                  tz=None, # default to UTC
                  inclusive="both"),
    columns=["UTC"]
)

tpn.UTC.dt.tz_localize("America/New_York")

# NonExistentTimeError: 2022-03-13 02:00:00
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
stucash
  • 1,078
  • 1
  • 12
  • 23
  • 1
    I think [this question](https://stackoverflow.com/q/75268283/10197418) is related. And I think `.cast(pl.Time)` should yield *local time*, not the underlying UTC representation. So essentially, it's a bug imho. – FObersteiner Mar 21 '23 at 13:16
  • 1
    @FObersteiner [v0.16.15](https://github.com/pola-rs/polars/pull/7735) just added `dt.datetime(), dt.date(), dt.time()` which I think address these issues? – jqurious Mar 24 '23 at 15:16
  • 1
    @jqurious yes, looks good now – FObersteiner Mar 24 '23 at 15:38

1 Answers1

1

I think the issue you're running into is that pl.cast(pl.Time) doesn't convert to local time.

You probably want to filter using .dt.hour():

In [23]: tz_replaced_filtered.filter(pl.col('NY').dt.hour().is_between(9, 16))
Out[23]:
shape: (6528, 2)
┌─────────────────────────┬──────────────────────────┐
│ UTC                     ┆ NY                       │
│ ---                     ┆ ---                      │
│ datetime[ns, UTC]       ┆ datetime[ns, US/Eastern] │
╞═════════════════════════╪══════════════════════════╡
│ 2022-01-03 09:00:00 UTC ┆ 2022-01-03 09:00:00 EST  │
│ 2022-01-03 09:05:00 UTC ┆ 2022-01-03 09:05:00 EST  │
│ 2022-01-03 09:10:00 UTC ┆ 2022-01-03 09:10:00 EST  │
│ 2022-01-03 09:15:00 UTC ┆ 2022-01-03 09:15:00 EST  │
│ …                       ┆ …                        │
│ 2022-03-11 16:40:00 UTC ┆ 2022-03-11 16:40:00 EST  │
│ 2022-03-11 16:45:00 UTC ┆ 2022-03-11 16:45:00 EST  │
│ 2022-03-11 16:50:00 UTC ┆ 2022-03-11 16:50:00 EST  │
│ 2022-03-11 16:55:00 UTC ┆ 2022-03-11 16:55:00 EST  │
└─────────────────────────┴──────────────────────────┘

EDIT

As of Polars 0.16.15, you can use .dt.time:

In [12]: tz_replaced_filtered = tdf.with_columns(
    ...:     pl.col("UTC").dt.replace_time_zone(time_zone="US/Eastern").alias("NY")
    ...: ).filter(
    ...:   pl.col("NY").dt.time().is_between(time(9), time(16), closed="both")
    ...: )

In [13]: tz_replaced_filtered
Out[13]:
shape: (5780, 2)
┌─────────────────────────┬──────────────────────────┐
│ UTC                     ┆ NY                       │
│ ---                     ┆ ---                      │
│ datetime[ns, UTC]       ┆ datetime[ns, US/Eastern] │
╞═════════════════════════╪══════════════════════════╡
│ 2022-01-03 09:00:00 UTC ┆ 2022-01-03 09:00:00 EST  │
│ 2022-01-03 09:05:00 UTC ┆ 2022-01-03 09:05:00 EST  │
│ 2022-01-03 09:10:00 UTC ┆ 2022-01-03 09:10:00 EST  │
│ 2022-01-03 09:15:00 UTC ┆ 2022-01-03 09:15:00 EST  │
│ …                       ┆ …                        │
│ 2022-03-11 15:45:00 UTC ┆ 2022-03-11 15:45:00 EST  │
│ 2022-03-11 15:50:00 UTC ┆ 2022-03-11 15:50:00 EST  │
│ 2022-03-11 15:55:00 UTC ┆ 2022-03-11 15:55:00 EST  │
│ 2022-03-11 16:00:00 UTC ┆ 2022-03-11 16:00:00 EST  │
└─────────────────────────┴──────────────────────────┘
ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65