5

Is there any way to compare two dates without calling strptime each time in python? I'm sure given my problem there's no other way, but want to make sure I've checked all options.

I'm going through a very large log file, each line has a date which I need to compare to see if that date is within the range of two other dates. I'm having to convert each date for each line with strptime which is causing a large bottleneck;

Fri Sep  2 15:12:43 2016    output2.file

         63518075 function calls (63517618 primitive calls) in 171.409 seconds

   Ordered by: cumulative time
   List reduced from 571 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.003    0.003  171.410  171.410 script.py:3(<module>)
        1    0.429    0.429  171.367  171.367 scipt.py:1074(main)
        1    3.357    3.357  162.009  162.009 script.py:695(get_data)
  1569898   14.088    0.000  141.175    0.000 script.py:648(check_line)
  1569902    6.899    0.000   71.706    0.000 {built-in method strptime}
  1569902   31.198    0.000   64.805    0.000 /usr/lib64/python2.7/_strptime.py:295(_strptime)
  1569876   15.324    0.000   43.170    0.000 script.py:626(dict_add)
  4709757   23.370    0.000   23.370    0.000 {method 'strftime' of 'datetime.date' objects}
  1569904    1.655    0.000   18.799    0.000 /usr/lib64/python2.7/_strptime.py:27(_getlang)
  1569899    2.103    0.000   17.452    0.000 script.py:592(reverse)

The dates are formatted like this;

current_date = 01/Aug/1995:23:59:53

And I'm comparing them like this;

with open(logfile) as file:
    for line in file:
        current_date = strptime_method(line)
        if current_date => end_date:
            break

Is there any alternative when it comes to comparing dates?

Edit: Thanks everyone, in particular user2539738. Here's the results based on his/her suggestion, big speed difference;

Fri Sep  2 16:14:59 2016    output3.file

         24270567 function calls (24270110 primitive calls) in 105.466 seconds

   Ordered by: cumulative time
   List reduced from 571 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002  105.466  105.466 script.py:3(<module>)
        1    0.487    0.487  105.433  105.433 script.py:1082(main)
        1    3.159    3.159   95.861   95.861 script.py:702(get_data)
  1569898   21.663    0.000   77.138    0.000 script.py:648(check_line)
  1569876   14.979    0.000   43.408    0.000 script.py:626(dict_add)
  4709757   23.865    0.000   23.865    0.000 {method 'strftime' of 'datetime.date' objects}
  1569899    1.943    0.000   15.556    0.000 script.py:592(reverse)
        1    0.000    0.000    9.078    9.078 script.py:1066(print_data)
        1    0.021    0.021    9.044    9.044 script.py:1005(print_ip)
       10    0.001    0.000    7.067    0.707 script.py:778(ip_api)
user1165419
  • 663
  • 2
  • 10
  • 21
  • 1
    If the input log records are ordered by date, you probably don't have to check every log record to be within the date range and can probably do a, say, binary search to determine the start and end records for your range. Just thoughts. – alecxe Sep 02 '16 at 15:45
  • What's `strptime_method`? Some of your own code? Also, are you using `time` (the functional module for processing dates and times) or `datetime` (the class-based module for same)? – Vivian Sep 02 '16 at 15:46
  • 1
    @alecxe That's what I currently already do. It will break from loop if it finds the date out of range. But if your range is quite large, then it can be time consuming as my results show, mainly due to the strptime method on each line that is called. – user1165419 Sep 02 '16 at 15:47
  • 1
    @DavidHeyman strptime_method is just my own code, I do a few things with the line in a function. I'm using datetime to convert, so I'm doing this; `current_time = datetime.datetime.strptime(date_from_line, "%d/%b/%Y:%H:%M:%S")` – user1165419 Sep 02 '16 at 15:48
  • Hm. Have you tried multiprocessing or threading? You might be able to save some time by searching for the first and last dates in the range simultaneously. Also, things might (maybe) process faster if you had each log in a different file instead of all in one file. – Vivian Sep 02 '16 at 15:49
  • Maybe try the `time` library instead, in case it's any faster (I think it's internally simpler, though I haven't got a chance to double-check the source myself right now). – Vivian Sep 02 '16 at 15:56
  • One of the reasons I really like the [ISO 8601 date/time format](https://en.wikipedia.org/wiki/ISO_8601) is because you can compare them as strings without needing to convert to anything else. – Mark Ransom Sep 02 '16 at 15:57

2 Answers2

1

I'm assuming current_date is a string

First, make a dictionary

moDict = {"Aug":8, "Jan":1} #etc

Then, find year/month/day etc

current_date = "01/Aug/1995:23:59:53"

Yr = int(current_date[7:11])
Mo = moDict[(current_date[3:6])]
Day = int(current_date[0:2])

m_date = datetime.datetime(Yr,Mo,Day)

And you can use that to make comparisons

Mohammad Athar
  • 1,953
  • 1
  • 15
  • 31
  • I would be entirely unsurprised to find that `strptime` already does this internally. Have you actually tested the speed? – Vivian Sep 02 '16 at 15:52
  • @DavidHeyman Even if `strptime` does this internally, it has to interpret the format string. On the other hand, it doesn't have to interpret Python. :) – Kaz Sep 02 '16 at 17:19
1

Since your dates appear to be in a fixed length format, it's trivially easy to parse and you don't need strptime to do it. You can rearrange them into the ISO 8601 date/time format and compare them directly as strings!

mos = {'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08', 'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}

def custom_to_8601(dt):
    return dt[7:11] + '-' + mos[dt[3:6]] + '-' + dt[0:2] + 'T' + dt[12:]

>>> custom_to_8601('01/Aug/1995:23:59:53')
'1995-08-01T23:59:53'

It might be a touch faster to use join instead of string concatenation and leave out the punctuation:

def comparable_date(dt):
    return ''.join([dt[7:11], mos[dt[3:6]], dt[0:2], dt[12:]])

>>> comparable_date('01/Aug/1995:23:59:53')
'1995080123:59:53'

Running cProfile on 1000000 repetitions for me produces these timings:

  • custom_to_8601: 0.978 seconds
  • comparable_date: 0.937 seconds
  • your original code with strptime: 15.492 seconds
  • an earlier answer using the datetime constructor: 1.134 seconds
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622