I have a large pandas dataframe (tens of millions of rows) which includes a column for UTC time and the time zone. I want to create a column which contains the local time for the row, based on these two other columns.
My original attempt was using df.apply
which worked on a small sample that I was testing on, but is very slow and isn't good enough to work on the whole data:
df['LoginTimeLocal'] = \
df.apply(lambda row: row.LoginTimeUtc.tz_localize('UTC').tz_convert(row.TimeZoneCode))
This results in a new column being added which contains a datetime in local time, with timezone information.
I came across this answer which provides an efficient, vectorized way to do something similar. I re-purposed this code to do what I want, but it appears to only work if the new column only contains dates with the same time zone (or no time zone information). Here's my code:
# localize all utc dates
df['LoginTimeUtc'] = df['LoginTimeUtc'].dt.tz_localize('UTC')
# initialize LoginTimeLocal column (probably not necessary)
df['LoginTimeLocal'] = df['LoginTimeUtc']
# for every time zone in the data
for tz in df.TimeZoneCode.unique():
mask = (df.TimeZoneCode == tz)
# make entries in a new column with converted timezone
df.loc[mask, 'LoginTimeLocal'] = \
df.loc[mask,'LoginTimeLocal'].dt.tz_convert(tz)
If I run this on a sample of the data that only contains dates from one timezone (i.e., len(df.TimeZoneCode.unique()) = 1
), it works fine. As soon as there are two or more timezones in the dataframe, I get a ValueError: incompatible or non tz-aware value
.
Can anyone see what is going wrong here?