2

From an online API I gather a series of data points, each with a value and an ISO timestamp. Unfortunately I need to loop over them, so I store them in a temporary dict and then create a pandas dataframe from that and set the index to the timestamp column (simplified example):

from datetime import datetime
import pandas


input_data = [
    '2019-09-16T06:44:01+02:00',
    '2019-11-11T09:13:01+01:00',
]

data = []
for timestamp in input_data:
    _date = datetime.fromisoformat(timestamp)

    data.append({'time': _date})

pd_data = pandas.DataFrame(data).set_index('time')

As long as all timestamps are in the same timezone and DST/non-DST everything works fine, and, I get a Dataframe with a DatetimeIndex which I can work on later. However, once two different time-offsets appear in one dataset (above example), I only get an Index, in my dataframe, which does not support any time-based methods.

Is there any way to make pandas accept timezone-aware, differing date as index?

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Timm
  • 202
  • 1
  • 11

2 Answers2

2

A minor correction of the question's wording, which I think is important. What you have are UTC offsets - DST/no-DST would require more information than that, i.e. a time zone. Here, this matters since you can parse timestamps with UTC offsets (even different ones) to UTC easily:

import pandas as pd

input_data = [
    '2019-09-16T06:44:01+02:00',
    '2019-11-11T09:13:01+01:00',
]

dti = pd.to_datetime(input_data, utc=True)
# dti
# DatetimeIndex(['2019-09-16 04:44:01+00:00', '2019-11-11 08:13:01+00:00'], dtype='datetime64[ns, UTC]', freq=None)

I prefer to work with UTC so I'd be fine with that. If however you need date/time in a certain time zone, you can convert e.g. like

dti = dti.tz_convert('Europe/Berlin')
# dti
# DatetimeIndex(['2019-09-16 06:44:01+02:00', '2019-11-11 09:13:01+01:00'], dtype='datetime64[ns, Europe/Berlin]', freq=None)
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • Your correction is absolutely right and important. However, with your answer all information on the original timezone is forgotten, which is okay in the example, but once you merge data from multiple sources I'd like to have it stored somewhere. – Timm Aug 20 '20 at 12:27
  • @Timm: still, if you *know* the time zone, why go through the pain of keeping offsets stored somewhere? Vice versa, if you only know the UTC offset, you cannot infer a specific time zone unambiguously, so why care? ;-) – FObersteiner Aug 20 '20 at 13:34
  • When I keep the offset in the original time, I cannot create a `DatetimeIndex` when the UTC-offset changes. When I remove it, merging data from different sources might/will result in it either being aligned by _sunrise/-set_, or I move to UTC and have it aligned by time. Both methods might require some additional, manual step if I need the other representation, and requires me to have this information readily available. I'm probably overengineering this. But I'm horribly frightened of datetime data in coding, it always makes something fail somewhere. – Timm Aug 21 '20 at 08:15
  • @Timm: "*it always makes something fail somewhere.*" - especially in Python, there're some pitfalls. Since I don't know any details of your work, my thoughts on this are pretty general, don't get me wrong. All I can say from my experience is that it's best to keep everything in UTC internally. This avoids many conversion mistakes. Only convert to some local time if you want a human to read it, e.g. for a presentation, display on a webpage or smth like that. – FObersteiner Aug 21 '20 at 08:22
  • I agree, and that's what I'm doing now. I used your suggestion with `utc=True`, perform all operations in UTC, but store the offset to allow reverting for human-friendly, timezone based output. – Timm Aug 21 '20 at 08:25
1
  • A pandas datetime column also requires the offset to be the same. A column with different offsets, will not be converted to a datetime dtype.
  • I suggest, do not convert the data to a datetime until it's in pandas.
  • Separate the time offset, and treat it as a timedelta
  • to_timedelta requires a format of 'hh:mm:ss' so add ':00' to the end of the offset
  • See Pandas: Time deltas for all the available timedelta operations
  • pandas.Series.dt.tz_convert
  • pandas.Series.tz_localize
  • Convert to a specific TZ with:
    • If a datetime is not datetime64[ns, UTC] dtype, then first use .dt.tz_localize('UTC') before .dt.tz_convert('US/Pacific')
    • Otherwise df.datetime_utc.dt.tz_convert('US/Pacific')
import pandas as pd

# sample data
input_data = ['2019-09-16T06:44:01+02:00', '2019-11-11T09:13:01+01:00']

# dataframe
df = pd.DataFrame(input_data, columns=['datetime'])

# separate the offset from the datetime and convert it to a timedelta
df['offset'] = pd.to_timedelta(df.datetime.str[-6:] + ':00')

# if desired, create a str with the separated datetime
# converting this to a datetime will lead to AmbiguousTimeError because of overlapping datetimes at 2AM, per the OP
df['datetime_str'] = df.datetime.str[:-6]

# convert the datetime column to a datetime format without the offset
df['datetime_utc'] = pd.to_datetime(df.datetime, utc=True)

# display(df)
                    datetime          offset        datetime_str              datetime_utc
0  2019-09-16T06:44:01+02:00 0 days 02:00:00 2019-09-16 06:44:01 2019-09-16 04:44:01+00:00
1  2019-11-11T09:13:01+01:00 0 days 01:00:00 2019-11-11 09:13:01 2019-11-11 08:13:01+00:00

print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   datetime      2 non-null      object             
 1   offset        2 non-null      timedelta64[ns]    
 2   datetime_str  2 non-null      object             
 3   datetime_utc  2 non-null      datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), object(2), timedelta64[ns](1)
memory usage: 192.0+ bytes

# convert to local timezone
df.datetime_utc.dt.tz_convert('US/Pacific')

[out]:
0   2019-09-15 21:44:01-07:00
1   2019-11-11 00:13:01-08:00
Name: datetime_utc, dtype: datetime64[ns, US/Pacific]

Other Resources

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158