10

I have some log parsing code that needs to turn a timestamp into a datetime object. I am using datetime.strptime but this function is using a lot of cputime according to cProfile's cumtime column. The timestamps are in the format of 01/Nov/2010:07:49:33.

The current function is:

new_entry['time'] = datetime.strptime(
        parsed_line['day'] +
        parsed_line['month'] +
        parsed_line['year'] +
        parsed_line['hour'] +
        parsed_line['minute'] +
        parsed_line['second']
        , "%d%b%Y%H%M%S"
)

Anyone know how I might optimize this?

Kyle Brandt
  • 26,938
  • 37
  • 124
  • 165

4 Answers4

16

If those are fixed width formats then there is no need to parse the line - you can use slicing and a dictionary lookup to get the fields directly.

month_abbreviations = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4,
                       'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8,
                       'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
year = int(line[7:11])
month = month_abbreviations[line[3:6]]
day = int(line[0:2])
hour = int(line[12:14])
minute = int(line[15:17])
second = int(line[18:20])
new_entry['time'] = datetime.datetime(year, month, day, hour, minute, second)

Testing in the manner shown by Glenn Maynard shows this to be about 3 times faster.

Community
  • 1
  • 1
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 1
    Made this into a function and tested it in my code against the same 1 million log lines several times going back and forth between this and strptime(). Total Parse time consistently when down from 80 to 50 seconds! – Kyle Brandt Nov 02 '10 at 18:40
  • Good solution. Could you please also suggest what can I do if I have 12 hour format for hours. Is there any other way to handle that besides putting conditions and doing it manually? – Naman Apr 15 '15 at 05:47
  • 1
    @Naman you could add `am_pm_offset={'AM':0,'PM':12}` and add that to the hours. – Mark Ransom Apr 15 '15 at 11:49
  • @MarkRansom Sorry for getting back on this so late but adding the offset doesn't work. Since 12:45 P.M. is valid time but offset added 24:45 is not. http://en.wikipedia.org/wiki/12-hour_clock . Any other fast method? I don't wanna put conditions – Naman Apr 27 '15 at 04:19
  • 1
    @Naman you're absolutely right, sorry I didn't think of that myself. You can use modular arithmetic to fix it: `hour = int(line[12:14]) % 12 + am_pm_offset[??]` – Mark Ransom Apr 27 '15 at 04:25
  • Thanks. This works perfectly and surprisingly a little faster than if condition checking. – Naman Apr 27 '15 at 04:44
3

It seems that using strptime() on a Windows platform uses a Python implementation (_strptime.py in the Lib directory). and not a C one. It might be quicker to process the string yourself.

from datetime import datetime
import timeit

def f():
    datetime.strptime ("2010-11-01", "%Y-%m-%d")

n = 100000
print "%.6f" % (timeit.timeit(f, number=n)/n)

returns 0.000049 on my system, whereas

from datetime import date
import timeit

def f():
    parts = [int (x) for x in "2010-11-01".split ("-")]
    return date (parts[0], parts[1], parts[2])    

n = 100000
print "%.6f" % (timeit.timeit(f, number=n)/n)

returns 0.000009

Andrew Miller
  • 143
  • 1
  • 9
2

Most recent answer: if moving to a straight strptime() has not improved the running time, then my suspicion is that there is actually no problem here: you have simply written a program, one of whose main purposes in life is to call strptime() very many times, and you have written it well enough — with so little other stuff that it does — that the strptime() calls are quite properly being allowed to dominate the runtime. I think you could count this as a success rather than a failure, unless you find that (a) some Unicode or LANG setting is making strptime() do extra work, or (b) you are calling it more often than you need to. Try, of course, to call it only once for each date to be parsed. :-)

Follow-up answer after seeing example date string: Wait! Hold on! Why are you parsing the line instead of just using a formatting string like:

"%d/%b/%Y:%H:%M:%S"

Original off-the-cuff-answer: If the month were a integer you could do something like this:

new_entry['time'] = datetime.datetime(
    int(parsed_line['year']),
    int(parsed_line['month']),
    int(parsed_line['day']),
    int(parsed_line['hour']),
    int(parsed_line['minute']),
    int(parsed_line['second'])
)

and avoid creating a big string just to make strptime() split it back apart again. I wonder if there is a way to access the month-name logic directly to do that one textual conversion?

Brandon Rhodes
  • 83,755
  • 16
  • 106
  • 147
  • Tried not parsing apart the date and letting strptime do it as per your edit. didn't make much of a difference in running time... – Kyle Brandt Nov 01 '10 at 16:47
  • When using strptime(), you should just us a formatting string. That's the intended use. – Rafe Kettler Nov 01 '10 at 16:50
  • Well I tried putting that part as its own set of worker threads to speed it up. I got the results I bet most non-threading masters get when they attempt this ... twice as slow ;-) – Kyle Brandt Nov 02 '10 at 17:18
  • The `striptime()` call is typically such a blazingly fast and simple call that absolutely any overhead that you add to it will just slow things down, which is why I did not suggest anything like caching in case there are duplicate dates. I mean, after all, it is written in C rather than Python. And threading was slower, by the way, because standard C Python is not thread-safe and so only one thread at a time can run Python code (though many threads can safely wait on I/O). – Brandon Rhodes Nov 02 '10 at 17:53
2

What's a "lot of time"? strptime is taking about 30 microseconds here:

from datetime import datetime
import timeit
def f():
    datetime.strptime("01/Nov/2010:07:49:33", "%d/%b/%Y:%H:%M:%S")
n = 100000
print "%.6f" % (timeit.timeit(f, number=n)/n)

prints 0.000031.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131