4

I have 180,000 rows of timestamps which I would like to parse into a datetime format like:

YYYY-MM-DD HH:MM:SS

Below are the timestamps (note lack of leading zeros in first 9 hours):

19-May-14 3:36:00 PM PDT
19-May-14 10:37:00 PM PDT 

I have been parsing these dates using parse_dates as part of pandas.read, but I have been finding this method slow (~80 seconds typically). I have also tried the dateutil parser with similar results.

I would like to parse the timestamps faster, but I am having trouble with the varying widths in the timestamps. I found this SO solution which seems quite similar to my problem, but failed to adapt the method to timestamps of varying length.

Could someone recommend a feasible adaptation to the linked solution, or another better method?

Thank you

Community
  • 1
  • 1
enmyj
  • 371
  • 4
  • 14
  • I don't know the function you mentioned, but did you use `pd.to_datetime`? – joris Jan 08 '15 at 20:37
  • On my computer, `pd.to_datetime(dates)` takes some 25 seconds (with 180,000 records). This method is using `dateutil.parse` under the hood, so it is logical it gives similar results. I doubt that any of the answers will be much faster/more convenient. – joris Jan 08 '15 at 21:29
  • I used the code provided below and was able to drop the time to about 3.5 seconds. Maybe the specificity of that code allows it to run much faster? – enmyj Jan 08 '15 at 22:39
  • Ah, you're right. Then I am positively surprised! Side note, `pd.to_datetime(dates, format="%d-%b-%y %H:%M:%S") ")` also takes only about 2 second on my computer, but the problem is that this cannot handle the AM/PM. – joris Jan 08 '15 at 22:58

4 Answers4

2

This solution builds upon the accepted answer supplied in the attached link and assumes that the time zone is composed of exactly 3 characters (and ignores its specific value).


You can extract the year, month and day based on their relative position to the beginning of the string, as follows:

month_abbreviations = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4,
                       'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8,
                       'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
day = int(line[0:2])
month = month_abbreviations[line[3:6]]
year = 2000 + int(line[7:9]) # this should be adapted to your specific use-case

You can extract the minutes, seconds and AM/PM based on their relative position to the end of the string, as follows:

AM_PM = line[-6:-4]
second = int(line[-9:-7])
minute = int(line[-12:-10])

You can extract the hour based on its relative position to the beginning and end of the string:

hour = int(line[10:-13])

Then you can just calculate the exact hour according to AM_PM value, as follows:

hour = hour if AM_PM == 'AM' else hour + 12

According to my calculations this is slightly faster than using a dict, but not by much:

hour_shifter = {(0, 'AM'): 0, (0, 'PM'): 12,
                (1, 'AM'): 1, (1, 'PM'): 13,
                ...
                (11, 'AM'): 11, (11, 'PM'): 23,
                (12, 'AM'): 12}
hour = hour_shifter[(hour, AM_PM)]

Now you may instantiate the datetime object:

datetime.datetime(year, month, day, hour, minute, second)
Community
  • 1
  • 1
Yoel
  • 9,144
  • 7
  • 42
  • 57
  • thank you very much! Your solution dropped the time to execute my code to around 3.5 seconds, a much needed speed boost! Cheers. – enmyj Jan 08 '15 at 22:35
0

How about using regex ? Can you provide your data file to test ?

patt = re.compile(r'(?P<day>\d\d)-(?P<month>\w+)-(?P<year>\d\d)'
                  r' (?P<hour>\d{1,2}):(?P<minute>\d\d):(?P<second>\d\d)'
                  r' (?P<noon>\w\w) (?P<tz>\w+)')

for date in dates:
    res = patt.match(date)
    print(res.groupdict())

Then convert day, month, year etc to integers, create timezone object:

from pytz import timezone
tz = timezone(res.groupdict()['tz'])
Andrew_Lvov
  • 4,621
  • 2
  • 25
  • 31
0

First, some questions.

  1. You show that hour as having either 1 or 2 characters. Does the day vary as well? Or is it always 2 characters?
  2. What are you doing with the timezone? Chucking it?
  3. How are you dealing with years that seem like they are from the 1900's? Do you have to deal with future dates at all? Are you certain that year 48 means 1948 and not 2048?

Here is what I would try. First build some lookup dictionaries for year and month.

months = {'Jan': '01', 'Feb': '02', ... 'Dec': '12'} 
years = {}
for i in range(50, 100):
    years[str(i)] = '19' + str(i)
for i in range(0, 50):
    years[str(i)] = '20' + str(i)

Loop through each record and

  1. split each string at the spaces
  2. extract the day, month and year substrings from the date string. Lookup year and month from the dictionaries. Use day as it is.
  3. Split the minutes and seconds from the hour component of the time. Minutes and seconds are good to go in their text form.
  4. Extract the integer value of the hour. Add 12 if the 3rd field from the split operation is 'PM', account for case if necessary.
  5. Re-assemble everything into your target format. Pad the hour with a zero if it is only one character.

Might be wise to test whether the year dictionary outperforms converting the two digit years to ints, checking the value, and adding either 1900 or 2000 depending on your chosen cutoff. I would expect the dictionary to win, but it is hard to say.

bigh_29
  • 2,529
  • 26
  • 22
0

Presuming that the "14" in your date string corresponds to 2014:

import datetime

month_abbr = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5,'Jun':6, 
              'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12
              }   

def format_date(date_str):
    day, month, year = (date_str.split(' ')[0]).split('-')
    hour, minute, sec = (date_str.split(' ')[1]).split(':')
    return datetime.datetime(int(year)+2000, month_abbr[month], 
           int(day), int(hour), int(minute), int(sec))


date_str = '19-May-14 3:36:00 PM PDT'
#date_str = '19-May-14 10:37:00 PM PDT'
formatted_date = format_date(date_str)
print(formatted_date)
2014-05-19 03:36:00

The defaul format for a datetime object is YYYY-MM-DD HH:MM:SS, so you don't need to specify a unique format in this case. If you do in the future, check out the strftime function in datetime.

If "14" can toggle between the 1900s and 2000s, then you'll need to (1) know that info before ingesting the date string and (2) tweak the above code to either add 1900 or 2000 to the year.

N1B4
  • 3,377
  • 1
  • 21
  • 24