Test Data:
import polars as pl
import pandas as pd
from datetime import date, time, datetime
# polars df for Jan to 12th, March 2022 before DST transition.
tdf = pl.DataFrame(
pl.date_range(
low=date(2022, 1, 3),
high=date(2022, 3, 12),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
My use case is to convert UTC data (or timezone-unaware) to EST/EDT data (or timezone-aware), then filter the data on a time range (9:30 to 16:00).
tz_replaced_filtered = tdf.with_columns(
pl.col("UTC").dt.replace_time_zone(time_zone="US/Eastern").alias("NY")
).filter(
pl.col("NY").cast(pl.Time).is_between(time(9), time(16), closed="both")
)
# output
shape: (5780, 2)
┌─────────────────────────┬──────────────────────────┐
│ UTC ┆ NY │
│ --- ┆ --- │
│ datetime[ns, UTC] ┆ datetime[ns, US/Eastern] │
╞═════════════════════════╪══════════════════════════╡
│ 2022-01-03 04:00:00 UTC ┆ 2022-01-03 04:00:00 EST │
│ 2022-01-03 04:05:00 UTC ┆ 2022-01-03 04:05:00 EST │
│ 2022-01-03 04:10:00 UTC ┆ 2022-01-03 04:10:00 EST │
│ 2022-01-03 04:15:00 UTC ┆ 2022-01-03 04:15:00 EST │
│ … ┆ … │
│ 2022-03-11 10:45:00 UTC ┆ 2022-03-11 10:45:00 EST │
│ 2022-03-11 10:50:00 UTC ┆ 2022-03-11 10:50:00 EST │
│ 2022-03-11 10:55:00 UTC ┆ 2022-03-11 10:55:00 EST │
│ 2022-03-11 11:00:00 UTC ┆ 2022-03-11 11:00:00 EST │
└─────────────────────────┴──────────────────────────┘
Apparently there's no obvious 9 to 16 time range in this resulted df
.
I probed further:
tz_replaced_filtered.with_columns(
pl.col("NY").cast(pl.Time).alias("NY_TIME"),
pl.col("UTC").cast(pl.Time).alias("UTC_TIME")
)
# output
shape: (5780, 4)
┌─────────────────────────┬──────────────────────────┬──────────┬──────────┐
│ UTC ┆ NY ┆ NY_TIME ┆ UTC_TIME │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ns, UTC] ┆ datetime[ns, US/Eastern] ┆ time ┆ time │
╞═════════════════════════╪══════════════════════════╪══════════╪══════════╡
│ 2022-01-03 04:00:00 UTC ┆ 2022-01-03 04:00:00 EST ┆ 09:00:00 ┆ 04:00:00 │
│ 2022-01-03 04:05:00 UTC ┆ 2022-01-03 04:05:00 EST ┆ 09:05:00 ┆ 04:05:00 │
│ 2022-01-03 04:10:00 UTC ┆ 2022-01-03 04:10:00 EST ┆ 09:10:00 ┆ 04:10:00 │
│ 2022-01-03 04:15:00 UTC ┆ 2022-01-03 04:15:00 EST ┆ 09:15:00 ┆ 04:15:00 │
│ … ┆ … ┆ … ┆ … │
│ 2022-03-11 10:45:00 UTC ┆ 2022-03-11 10:45:00 EST ┆ 15:45:00 ┆ 10:45:00 │
│ 2022-03-11 10:50:00 UTC ┆ 2022-03-11 10:50:00 EST ┆ 15:50:00 ┆ 10:50:00 │
│ 2022-03-11 10:55:00 UTC ┆ 2022-03-11 10:55:00 EST ┆ 15:55:00 ┆ 10:55:00 │
│ 2022-03-11 11:00:00 UTC ┆ 2022-03-11 11:00:00 EST ┆ 16:00:00 ┆ 11:00:00 │
└─────────────────────────┴──────────────────────────┴──────────┴──────────┘
The underlying timestamps does seem to show that I succeed in filtering on time range 9 to 16; and it is only the string representation which isn't in sync.
tdf.with_columns(
pl.col("UTC").dt.replace_time_zone(time_zone="US/Eastern").alias("NY_REPLACED"),
pl.col("UTC").dt.convert_time_zone(time_zone="US/Eastern").alias("NY_CONVERTED")
)
# output
shape: (19585, 3)
┌─────────────────────────┬──────────────────────────┬──────────────────────────┐
│ UTC ┆ NY_REPLACED ┆ NY_CONVERTED │
│ --- ┆ --- ┆ --- │
│ datetime[ns, UTC] ┆ datetime[ns, US/Eastern] ┆ datetime[ns, US/Eastern] │
╞═════════════════════════╪══════════════════════════╪══════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-03 00:00:00 EST ┆ 2022-01-02 19:00:00 EST │
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-03 00:05:00 EST ┆ 2022-01-02 19:05:00 EST │
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-03 00:10:00 EST ┆ 2022-01-02 19:10:00 EST │
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-03 00:15:00 EST ┆ 2022-01-02 19:15:00 EST │
│ … ┆ … ┆ … │
│ 2022-03-11 23:45:00 UTC ┆ 2022-03-11 23:45:00 EST ┆ 2022-03-11 18:45:00 EST │
│ 2022-03-11 23:50:00 UTC ┆ 2022-03-11 23:50:00 EST ┆ 2022-03-11 18:50:00 EST │
│ 2022-03-11 23:55:00 UTC ┆ 2022-03-11 23:55:00 EST ┆ 2022-03-11 18:55:00 EST │
│ 2022-03-12 00:00:00 UTC ┆ 2022-03-12 00:00:00 EST ┆ 2022-03-11 19:00:00 EST │
└─────────────────────────┴──────────────────────────┴──────────────────────────┘
This final step concludes that replaced_time_zone
and convert_time_zone
does not generate the same string representation for an identical time zone conversion operation.
More specifically, replaced_time_zone
does replace the underlying timestamps however it does not "replace" the string representation; what replaced_time_zone
does on the surface for UTC (or simply timezone-unaware timestamps) is like with_time_zone
, i.e., simply attaching time zone info to existing time stamps.
I'd appreciate if someone could clarify above.
Question Background
This is for polars users and devs such that all questions related to timezone conversion raised by me for polars
can be easily referenced.
I firstly raised this to see how we could filter based on converted timestamps (utc offset accounted for), then as polars
provided replace_time_zone
a question for it is also raised.
A related question metioned by @FObersteiner is here.
For the time being my understanding is that replace_time_zone
should provide the same string representation as convert_time_zone
, because essentially what they are doing it the same. And this would make filtering on date/time range in polars
WYSIWYG (What You See Is What You Get).
Background Information
Polars
provides replace_time_zone
and convert_time_zone
to handle scenarios like this.
convert_time_zone
changes only the string representation presented to the user, not the underlying timestamps. Therefore cast
is not going to work with "converted" timestamps in my use case (filtering would still be on the orignal underlying timestamps).
Regarding replace_time_zone
, DST transitions is causing problem when user tries to replace_time_zone
and this issue seems to be universal given that particular timestamp just doesn't exist because of DST transition:
# demo polars df
tpl = pl.DataFrame(
pl.date_range(
low=date(2022, 3, 13),
high=date(2022, 3, 15),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
tpl.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("NY")
)
# Panic Exception: no such local time
# demo pandas pf
tpn = pd.DataFrame(
pd.date_range(start="2022-03-13",
end="2022-03-15",
freq="5T", # 5 minute interval
tz=None, # default to UTC
inclusive="both"),
columns=["UTC"]
)
tpn.UTC.dt.tz_localize("America/New_York")
# NonExistentTimeError: 2022-03-13 02:00:00