4

I have about 800,000 rows of data in a dataframe, and one column of the data df['Date'] is string of time and date 'YYYY-MM-DD HH:MM:SS.fff', which doesn't have timezone information. However I know they are in New_York timezone and they need to be convert into CET. Now I have two methods to get the job done:

method 1 (very slow for sure):

df['Date'].apply(lambda x: timezone('America/New_York')\
            .localize(datetime.datetime.strptime(x,'%Y%m%d%H:%M:%S.%f'))\
            .astimezone(timezone('CET')))

method 2 :

df.index = pd.to_datetime(df['Date'],format='%Y%m%d%H:%M:%S.%f')
df.index.tz_localize('America/New_York').tz_convert('CET')

I am just wondering if there are any other better ways to do it? or any potential pitfalls of the methods I listed? Thanks!

Also, I would like to shift all timestamp by a fix amount of time, such as 1ms timedelta(0,0,1000), how can I implement it using method 2?

firelynx
  • 30,616
  • 9
  • 91
  • 101
user6396
  • 1,832
  • 6
  • 23
  • 38
  • I'd expect the second method to be significantly faster, the first is going to perform the conversion row-wise and this would be how I'd do this – EdChum May 14 '15 at 08:07
  • In any case, the second one is the "pandas" way of doing this (using the built-in functions), so no need to look for other ways I think – joris May 14 '15 at 12:45
  • If I want to shift all timestamp by a centain time such 1ms (timedelta(0,0,1000)), how can I do it? – user6396 May 14 '15 at 13:59
  • 1
    You can just construct a timedelta and add it to offset it e.g `df['Date'] + pd.Timedelta(1, unit='ms')` – EdChum May 14 '15 at 22:51

1 Answers1

3

Method 2 is definately the best way of doing this.

However, it occurs to me that you are formatting this date after you have loaded the data.

It is much faster to parse dates on load of a file, than it is to change them after you have loaded it. (Not to mention cleaner)

If your data is loaded from a csv file using the pandas.read_csv() function for instance, then you can use the parse_dates= option and the date_parser= option.

You can try it out directly with your lambda function as the date_parser= and just set the parse_dates= to a list of your date columns.

Like this:

pd.read_csv('myfile.csv', parse_dates=['Date'] date_parser=lambda x: timezone('America/New_York')\
        .localize(datetime.datetime.strptime(x,'%Y%m%d%H:%M:%S.%f'))\
        .astimezone(timezone('CET')))

Should work and will probably be the fastest.

firelynx
  • 30,616
  • 9
  • 91
  • 101