14

I have a Python datetime timestamp and a large dict (index) where keys are timestamps and the values are some other information I'm interested in.

I need to find the datetime (the key) in index that is closest to timestamp, as efficiently as possible.

At the moment I'm doing something like:

for timestamp in timestamps:
    closestTimestamp = min(index,key=lambda datetime : abs(timestamp - datetime))

which works, but takes too long - my index dict has millions of values, and I'm doing the search thousands of times. I'm flexible with data structures and so on - the timestamps are roughly sequential, so that I'm iterating from the first to the last timestamps. Likewise the timestamps in the text file that I load into the dict are sequential.

Any ideas for optimisation would be greatly appreciated.

Caligari
  • 1,381
  • 2
  • 10
  • 14
  • Is the large dict relatively static, or do you add and delete entries often? – rob mayoff Nov 17 '11 at 05:09
  • The dict is effectively entirely static. – Caligari Nov 17 '11 at 05:11
  • Thanks very much for all the useful answers. I've had a bit of a play around with the suggestions and it looks like I'll definately be able to solve my problem, the speed increases are huge. Home time now, so I'll have a bit more of a play tomorrow and update with my final implementation. – Caligari Nov 17 '11 at 06:22
  • Thanks again for all the answers, this is such a brilliant community. Because a dict makes sense in the rest of the code, just not for this search, I ended up making a list from the dict keys, sorting it, then using bisect_left to search it, then accessing the values in my original dict using the key from the list. The increase in speed is insane, like 10k times faster. – Caligari Nov 17 '11 at 23:07

3 Answers3

27

Dictionaries aren't organized for efficient near miss searches. They are designed for exact matches (using a hash table).

You may be better-off maintaining a separate, fast-searchable ordered structure.

A simple way to start off is to use the bisect module for fast O(log N) searches but slower O(n) insertions:

def nearest(ts):
    # Given a presorted list of timestamps:  s = sorted(index)
    i = bisect_left(s, ts)
    return min(s[max(0, i-1): i+2], key=lambda t: abs(ts - t))

A more sophisticated approach suitable for non-static, dynamically updated dicts, would be to use blist which employs a tree structure for fast O(log N) insertions and lookups. You only need this if the dict is going to change over time.

If you want to stay with a dictionary based approach, consider a dict-of-lists that clusters entries with nearby timestamps:

 def get_closest_stamp(ts):
      'Speed-up timestamp search by looking only at entries in the same hour'
      hour = round_to_nearest_hour(ts)
      cluster = daydict[hour]         # return a list of entries
      return min(cluster, key=lambda t: abs(ts - t))

Note, for exact results near cluster boundaries, store close-to-the-boundary timestamps in both the primary cluster and the adjacent cluster.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 2
    Excellent comprehensive answer! (It's nice seeing you here on SO, by the way, Raymond. :)) – Nick Johnson Nov 17 '11 at 05:46
  • why i+2 in return min(s[max(0, i-1): i+2], key=lambda t: abs(ts - t)) ? Seems to me like it could be +1 and it would still work – Hammer Nov 06 '13 at 23:45
4

datetime objects are comparable to each other, so make a sorted list of your key/value pairs like this:

myPairs = list(dict.iteritems())
myPairs.sort()

For each element myPairs[i], myPairs[i][0] is the datetime key and myPairs[i][1] is the value.

You can search this list efficiently using bisect_left:

import bisect
i = bisect.bisect_left(myPairs, targetDatetime)

The element myPairs[i] is the element with the lowest datetime no earlier than targetDatetime. But the prior element (if there is one) might be closer in time to targetDatetime. Or targetDatetime might be later than any time in myPairs. So you need to check:

if i > 0 and i == len(myPairs):
    i -= 1
elif i > 0 and targetDatetime - myPairs[i-1][0] < myPairs[i][0]- targetDatetime:
    i -= 1
rob mayoff
  • 375,296
  • 67
  • 796
  • 848
2

If your list is truly sorted and not just "roughly sequential", you can use a binary search. Have a look at the bisect module documentation for more information.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622