Efficiently convert timezones in pandas dataframe

Question

I have a large pandas dataframe (tens of millions of rows) which includes a column for UTC time and the time zone. I want to create a column which contains the local time for the row, based on these two other columns.

My original attempt was using df.apply which worked on a small sample that I was testing on, but is very slow and isn't good enough to work on the whole data:

df['LoginTimeLocal'] = \ 
    df.apply(lambda row: row.LoginTimeUtc.tz_localize('UTC').tz_convert(row.TimeZoneCode))

This results in a new column being added which contains a datetime in local time, with timezone information.

I came across this answer which provides an efficient, vectorized way to do something similar. I re-purposed this code to do what I want, but it appears to only work if the new column only contains dates with the same time zone (or no time zone information). Here's my code:

# localize all utc dates
df['LoginTimeUtc'] = df['LoginTimeUtc'].dt.tz_localize('UTC')

# initialize LoginTimeLocal column (probably not necessary)
df['LoginTimeLocal'] = df['LoginTimeUtc']

# for every time zone in the data
for tz in df.TimeZoneCode.unique():
    mask = (df.TimeZoneCode == tz)

    # make entries in a new column with converted timezone
    df.loc[mask, 'LoginTimeLocal'] = \ 
        df.loc[mask,'LoginTimeLocal'].dt.tz_convert(tz)

If I run this on a sample of the data that only contains dates from one timezone (i.e., len(df.TimeZoneCode.unique()) = 1), it works fine. As soon as there are two or more timezones in the dataframe, I get a ValueError: incompatible or non tz-aware value.

Can anyone see what is going wrong here?

score 2 · Accepted Answer · answered Dec 14 '17 at 21:31

Demo:

Source DF:

In [11]: df
Out[11]:
             datetime         time_zone
0 2016-09-19 01:29:13    America/Bogota
1 2016-09-19 02:16:04  America/New_York
2 2016-09-19 01:57:54      Africa/Cairo
3 2016-09-19 11:00:00    America/Bogota
4 2016-09-19 12:00:00  America/New_York
5 2016-09-19 13:00:00      Africa/Cairo

Solution:

In [12]: df['new'] =  df.groupby('time_zone')['datetime'] \
                        .transform(lambda x: x.dt.tz_localize(x.name))

In [13]: df
Out[13]:
             datetime         time_zone                 new
0 2016-09-19 01:29:13    America/Bogota 2016-09-19 06:29:13
1 2016-09-19 02:16:04  America/New_York 2016-09-19 06:16:04
2 2016-09-19 01:57:54      Africa/Cairo 2016-09-18 23:57:54
3 2016-09-19 11:00:00    America/Bogota 2016-09-19 16:00:00
4 2016-09-19 12:00:00  America/New_York 2016-09-19 16:00:00
5 2016-09-19 13:00:00      Africa/Cairo 2016-09-19 11:00:00

@timleathart, usually that happens when you want to convert a timestamp that doesn't exist because of 1 hour jump when switching from winter to summer time... — MaxU - stand with Ukraine, Dec 14 '17 at 21:41

Efficiently convert timezones in pandas dataframe

1 Answers1

Linked