2

I wanted to add/subtract the UTC offset (usually in hours) to/from the datetime object in polars but I don't seem to see a way to do this. the UTC offset can be dynamic given there's Day Light Saving period comes into play in a calendar year. (e.g., EST/EDT maps to 5/4 hours of UTC offset respectively).

from datetime import datetime
import pytz
import polars as pl
from datetime import date

# Make a datetime-only dataframe that covers DST period of year, in UTC time first.
 df = pl.DataFrame(
         pl.date_range(low=date(2022,1,3), 
                       high=date(2022,9,30), 
                       interval="5m", 
                       time_unit="ns", 
                       time_zone="UTC")
           .alias("timestamp")
      )

 # Convert timezone to "America/New_York", which covers both EST and EDT.
 us_df = df.with_column(
                        pl.col("timestamp")
                          .dt
                          .cast_time_zone(tz="America/New_York")
                          .alias("datetime")
         )

 # Check us_df output
 us_df
 # output, here `polars` is showing US time without the UTC offset 
 # Before 0.14.22 `polars` is showing time with UTC offset
 # i.e., `23:45:00 UTC` should be `19:45:00 EDT`
 # Now `polars` is showing `15:45:00 EDT`, without 4 hours of offset
┌─────────────────────────┬────────────────────────────────┐
│ timestamp               ┆ datetime                       │
│ ---                     ┆ ---                            │
│ datetime[ns, UTC]       ┆ datetime[ns, America/New_York] │
╞═════════════════════════╪════════════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-02 14:00:00 EST        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-02 14:05:00 EST        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-02 14:10:00 EST        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-02 14:15:00 EST        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                     ┆ ...                            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:45:00 UTC ┆ 2022-09-29 15:45:00 EDT        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:50:00 UTC ┆ 2022-09-29 15:50:00 EDT        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:55:00 UTC ┆ 2022-09-29 15:55:00 EDT        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-30 00:00:00 UTC ┆ 2022-09-29 16:00:00 EDT        │
└─────────────────────────┴────────────────────────────────┘

Converting to_pandas, we should observe that the underlying datetime object does not include that 4 hours of offset in the actual time as well (remember EST is also in this dataframe, and it has a 5-hour offset).

 # Convert to pandas
 us_pd = us_df.to_pandas()
 us_pd
 # output

                      timestamp                  datetime
0     2022-01-03 00:00:00+00:00 2022-01-02 14:00:00-05:00
1     2022-01-03 00:05:00+00:00 2022-01-02 14:05:00-05:00
2     2022-01-03 00:10:00+00:00 2022-01-02 14:10:00-05:00
3     2022-01-03 00:15:00+00:00 2022-01-02 14:15:00-05:00
4     2022-01-03 00:20:00+00:00 2022-01-02 14:20:00-05:00
...                         ...                       ...
77756 2022-09-29 23:40:00+00:00 2022-09-29 15:40:00-04:00
77757 2022-09-29 23:45:00+00:00 2022-09-29 15:45:00-04:00
77758 2022-09-29 23:50:00+00:00 2022-09-29 15:50:00-04:00
77759 2022-09-29 23:55:00+00:00 2022-09-29 15:55:00-04:00
77760 2022-09-30 00:00:00+00:00 2022-09-29 16:00:00-04:00

What I wanted was to include the UTC offset into the actual time, such that I can do filtering on the time (in a natural way). For instance, if I am seeing 2300UTC is 1900EDT, I can filter using 1900 directly (please note I can't just add/substract the UTC offset on the fly during filtering, as the number of hours is dynamic given DST).

The underlying python datetime does have utcoffset function, which can be applied on each datetime object, but I'd need to convert polars to pandas first (I don't see how to do this within polars).

I've also observed this peculiar difference:

  us_pd.datetime[us_pd.shape[0]-1].to_pydatetime()

  # We can see it is identical to what's already in `polars` and `pandas` dataframe.

  datetime.datetime(2022, 9, 29, 16, 0, tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)

  # Now we create a single datetime object with arbitrary UTC time and convert it to New York time

  datetime(2022, 9, 30, 22, 45, 0,0, pytz.utc).astimezone(pytz.timezone("America/New_York"))

  # The representation here is actually the correct New York time (as in, the offset has been included)

  datetime.datetime(2022, 9, 30, 18, 45, tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)
  
  
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
stucash
  • 1,078
  • 1
  • 12
  • 23
  • 1
    You are saying the time "does not include .. the offset", but the real issue here is that the timezone correction is being applied TWICE, right? 0000 UTC should become 1900 EST on the previous day, and you're seeing 1400. – Tim Roberts Oct 22 '22 at 19:51
  • @TimRoberts You are very right thanks for pointing this out! that actually is the issue here. – stucash Feb 20 '23 at 09:10

2 Answers2

4

py-polars 0.16.3 update: It seems you're looking for convert_time_zone. Ex:

from datetime import date
import polars as pl

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 1, 3),
        high=date(2022, 9, 30),
        interval="5m",
        time_unit="ns",
        time_zone="UTC",
    ).alias("timestamp")
)

us_df = df.with_columns(
    pl.col("timestamp").dt.convert_time_zone(time_zone="America/New_York").alias("datetime")
)


┌─────────────────────────┬────────────────────────────────┐
│ timestamp               ┆ datetime                       │
│ ---                     ┆ ---                            │
│ datetime[ns, UTC]       ┆ datetime[ns, America/New_York] │
╞═════════════════════════╪════════════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-02 19:00:00 EST        │
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-02 19:05:00 EST        │
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-02 19:10:00 EST        │
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-02 19:15:00 EST        │
│ ...                     ┆ ...                            │
│ 2022-09-29 23:45:00 UTC ┆ 2022-09-29 19:45:00 EDT        │
│ 2022-09-29 23:50:00 UTC ┆ 2022-09-29 19:50:00 EDT        │
│ 2022-09-29 23:55:00 UTC ┆ 2022-09-29 19:55:00 EDT        │
│ 2022-09-30 00:00:00 UTC ┆ 2022-09-29 20:00:00 EDT        │
└─────────────────────────┴────────────────────────────────┘
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
1

Update - this response is out of date, see the answer using .dt.replace_time_zone above

braaannigan
  • 594
  • 4
  • 12
  • hey sorry it's been quite some time.. long story short, flashing bios corrupted my PC while I am was moving to a new place... all I left was the stuff I've got in my gitlab repos. – stucash Jan 10 '23 at 16:28
  • So back to the question, did you read the comment by @Tim Roberts? To me his point was exactly what you have pointed out in your answer, the offset correction was applied twice whereas we should only see once; but as you have explained, it is ony the string representation that seems out of sync (because the undelrying numeric representation correctly applied 5 hours offset only once). – stucash Jan 10 '23 at 16:30
  • I'd suggest looking at the new answer, which using the updated 0.16.3 API – ignoring_gravity Feb 11 '23 at 10:02
  • My response is out of date with the new release, I've flagged it for deletion with a moderator – braaannigan Feb 16 '23 at 13:27