7

I have a dataframe with a column for the time zone and a column for the datetime. I would like to convert these to UTC first to join with other data, and then I'll have some calculations to convert from UTC to the viewers local time zone eventually.

datetime              time_zone
2016-09-19 01:29:13   America/Bogota 
2016-09-19 02:16:04   America/New_York
2016-09-19 01:57:54   Africa/Cairo

def create_utc(df, column, time_format='%Y-%m-%d %H:%M:%S'):
    timezone = df['TZ']
    df[column + '_utc'] = df[column].dt.tz_localize(timezone).dt.tz_convert('UTC').dt.strftime(time_format)
    df[column + '_utc'].replace('NaT', np.nan, inplace=True)
    df[column + '_utc'] = pd.to_datetime(df[column + '_utc'])
    return df

That was my flawed attempt. The error is that the truth is ambiguous which makes sense because the 'timezone' variable is referring to a column. How do I refer to the value in the same row?

Edit: here are some results from the answers below on one day of data (394,000 rows and 22 unique time zones). Edit2: I added a groupby example in case someone wants to see the results. It is the fastest, by far.

%%timeit

for tz in df['TZ'].unique():
    df.ix[df['TZ'] == tz, 'datetime_utc2'] = df.ix[df['TZ'] == tz, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
df['datetime_utc2'] = df['datetime_utc2'].dt.tz_localize(None)

1 loops, best of 3: 1.27 s per loop

%%timeit

df['datetime_utc'] = [d['datetime'].tz_localize(d['TZ']).tz_convert('UTC') for i, d in df.iterrows()]
df['datetime_utc'] = df['datetime_utc'].dt.tz_localize(None)

1 loops, best of 3: 50.3 s per loop

df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('TZ')])



**1 loops, best of 3: 249 ms per loop**
JJJ
  • 1,009
  • 6
  • 19
  • 31
trench
  • 5,075
  • 12
  • 50
  • 80

2 Answers2

14

Here is a vectorized approach (it will loop df.time_zone.nunique() times):

In [2]: t
Out[2]:
             datetime         time_zone
0 2016-09-19 01:29:13    America/Bogota
1 2016-09-19 02:16:04  America/New_York
2 2016-09-19 01:57:54      Africa/Cairo
3 2016-09-19 11:00:00    America/Bogota
4 2016-09-19 12:00:00  America/New_York
5 2016-09-19 13:00:00      Africa/Cairo

In [3]: for tz in t.time_zone.unique():
   ...:         mask = (t.time_zone == tz)
   ...:         t.loc[mask, 'datetime'] = \
   ...:             t.loc[mask, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
   ...:

In [4]: t
Out[4]:
             datetime         time_zone
0 2016-09-19 06:29:13    America/Bogota
1 2016-09-19 06:16:04  America/New_York
2 2016-09-18 23:57:54      Africa/Cairo
3 2016-09-19 16:00:00    America/Bogota
4 2016-09-19 16:00:00  America/New_York
5 2016-09-19 11:00:00      Africa/Cairo

UPDATE:

In [12]: df['new'] = df.groupby('time_zone')['datetime'] \
                       .transform(lambda x: x.dt.tz_localize(x.name))

In [13]: df
Out[13]:
             datetime         time_zone                 new
0 2016-09-19 01:29:13    America/Bogota 2016-09-19 06:29:13
1 2016-09-19 02:16:04  America/New_York 2016-09-19 06:16:04
2 2016-09-19 01:57:54      Africa/Cairo 2016-09-18 23:57:54
3 2016-09-19 11:00:00    America/Bogota 2016-09-19 16:00:00
4 2016-09-19 12:00:00  America/New_York 2016-09-19 16:00:00
5 2016-09-19 13:00:00      Africa/Cairo 2016-09-19 11:00:00
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • I'll vote this as the answer. I am getting sub 1 second speeds for over 320k rows versus almost 1 minute for the other answer. I am just hoping someone also shares a groupby snippet I can test too. – trench Sep 23 '16 at 17:51
  • 1
    Hello, I don't get where the column "name" comes from? in tz_localize(x.name) Thanks – Solal Jul 20 '20 at 09:44
  • Also I don't get the same format than you. I get with the UTC information: 2020-07-20 20:30:00-07:00 – Solal Jul 20 '20 at 09:47
  • This code seems to work for me, but I'm also confused about the 'x.name' piece. What is the .name doing? (My guess is that it calls the groupby name string, but I don't know for sure since I don't find documentation, nor do I know how to test that this is the case). Thanks! – amquack Aug 27 '21 at 16:56
  • 1
    @amquack, try this: `df.groupby('time_zone')['datetime'].apply(lambda x: print(x.name))` ;) – MaxU - stand with Ukraine Aug 27 '21 at 18:55
1

Your issue is that tz_localize() can only take a scalar value, so we'll have to iterate through the DataFrame:

df['datetime_utc'] = [d['datetime'].tz_localize(d['time_zone']).tz_convert('UTC') for i,d in df.iterrows()]

The result is:

            datetime         time_zone              datetime_utc
0 2016-09-19 01:29:13    America/Bogota 2016-09-19 06:29:13+00:00
1 2016-09-19 02:16:04  America/New_York 2016-09-19 06:16:04+00:00
2 2016-09-19 01:57:54      Africa/Cairo 2016-09-18 23:57:54+00:00

An alternative approach is to group by the timezone and convert all matching rows in one pass:

df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('time_zone')])
chrisaycock
  • 36,470
  • 14
  • 88
  • 125
  • Let me try this. I just never got around to enumerate or iterrows. I just never know when to use those commands – trench Sep 22 '16 at 19:44
  • 1
    way to do this is to groupby with a transform on time_zone and convert all of that zone - iirc i solved this iin a SO question once – Jeff Sep 22 '16 at 23:12
  • I tried to look through your answer history but didn't see it. The method above seemed to work but it would be cool to see the groupby example if you ever find it. I am doing this to about 2 mil rows per month so it would be nice to test the speed of each method – trench Sep 23 '16 at 11:48
  • 1
    @trench, how many unique time zones do you have in your 2 mil. rows DF? `print(df.time_zone.nunique())` - should give you a number... – MaxU - stand with Ukraine Sep 23 '16 at 14:05
  • I checked some data from the last few days and there were 32. – trench Sep 23 '16 at 14:40
  • I added some %%timeit results. The loop and .ix approach was 1 second versus 50 for the iterrows. I'd be interested in testing the groupby approach too if someone can give me a sample snippet to work with – trench Sep 23 '16 at 15:21
  • @trench I've updated my answer to include the `groupby`. Can you check the performance? – chrisaycock Sep 23 '16 at 19:00