10

I want to know how I could perform some kind of index on keys from a python dictionary. The dictionary holds approx. 400,000 items, so I am trying to avoid a linear search.

Basically, I am trying to find if the userinput is inside any of the dict keys.

for keys in dict:
    if userinput in keys:
        DoSomething()
        break

That would be an example of what I am trying to do. Is there a way to search in a more direct way, without a loop ? or what would be a more efficient way.

Clarification: The userinput is not exactly what the key will be, eg userinput could be log, whereas the key is logfile

Edit: any list/cache creation, pre-processing or organisation that can be done prior to searching is acceptable. The only thing that needs to be quick is the search for the key.

dreftymac
  • 31,404
  • 26
  • 119
  • 182
Trent
  • 1,275
  • 7
  • 17
  • 22
  • What is the type of your key? I don't think you need two loops. – Mikel Mar 02 '11 at 22:49
  • 2
    When all available (and implementable) algorithms have inacceptable complexity, it's time to rethink the problem or the data structure used. There are (relatively exotic) data structures for fuzzy string matching, although I don't know if there are any for abritary substrings. –  Mar 02 '11 at 22:51
  • If userinput will always start at the beginning of a key, then you could create a `keycache` of nested dicts `{'a':{'b':['abacus', 'absinthe'...] ...} ...}` – senderle Mar 02 '11 at 22:57
  • The key is just a string, eg 'logfile' – Trent Mar 02 '11 at 22:57
  • @Trent: Then your `if userinput in keys` line is wrong. Can you try removing the for loop, make it work for one item, then update your question with the code you mean? – Mikel Mar 02 '11 at 23:01
  • 1
    But is the userinput always at the beginning of the key? – kojiro Mar 02 '11 at 23:04
  • 1
    @Mikel how is it wrong? Using plural 'keys' is confusing, but `'foo' in 'foobar' is True`. Seems right to me. – kojiro Mar 02 '11 at 23:05
  • @Trent, @kojiro: Sorry, you're right, I should have tested first. Yay for Python's string/list duality. – Mikel Mar 02 '11 at 23:11
  • 1
    @Trent, I'm still wondering whether `userinput` is always a _prefix_, as in your 'log'/'logfile' example, or whether `userinput` is an arbitrary substring, as in 'file'/'longfilename'. It makes a pretty big difference. – senderle Mar 03 '11 at 02:00
  • 1
    substring :) need the flexibility – Trent Mar 03 '11 at 04:23

6 Answers6

6

If you only need to find keys that start with a prefix then you can use a trie. More complex data structures exist for finding keys that contain a substring anywhere within them, but they take up a lot more space to store so it's a space-time trade-off.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Ah, "trie" -- guess that's the word I wanted for my comment above. +1 – senderle Mar 02 '11 at 23:05
  • Wouldn't building the trie itself be quite costly? You'd still have to iterate through the whole dictionary, and at that point, it's not going to be substantially better than the naive solution, and might be worse for the average case. – Chinmay Kanchi Mar 03 '11 at 03:09
  • @Chinmay Kanchi, it would be costly initially, but you'd only have to do it once, right? All future prefix lookups would be fast, because you'd look in the trie for a match or matches. (Presumably you'd present a list of potential matches if there are more than one in the trie, or have some other algorithm for picking one.) Furthermore, the trie could be built in advance and pickled; and if the dict changes, it could be updated fairly easily & pickled again at program exit. – senderle Mar 03 '11 at 04:00
  • @senderle: Never mind, I posted that before I saw the updated post by the OP. – Chinmay Kanchi Mar 03 '11 at 04:06
3

No. The only way of searching for a string in dictionary keys is to look in each key. Something like what you've suggested is the only way of doing it with a dictionary.

However, if you have 400,000 records and you want to speed up your search, I'd suggest using an SQLite database. Then you can just say SELECT * FROM TABLE_NAME WHERE COLUMN_NAME LIKE '%userinput%';. Look at the documentation for Python's sqlite3 module here.

Another option is to use a generator expression, as these are almost always faster than the equivalent for loops.

filteredKeys = (key for key in myDict.keys() if userInput in key)
for key in filteredKeys:
    doSomething()

EDIT: If, as you say, you don't care about one-time costs, use a database. SQLite should do what you want damn near perfectly.

I did some benchmarks, and to my surprise, the naive algorithm is actually twice as fast as a version using list comprehensions and six times as fast as a SQLite-driven version. In light of these results, I'd have to go with @Mark Byers and recommend a Trie. I've posted the benchmark below, in case someone wants to give it a go.

import random, string, os
import time
import sqlite3

def buildDict(numElements):
    aDict = {}
    for i in xrange(numElements-10):
        aDict[''.join(random.sample(string.letters, 6))] = 0

    for i in xrange(10):
        aDict['log'+''.join(random.sample(string.letters, 3))] = 0

    return aDict

def naiveLCSearch(aDict, searchString):
    filteredKeys = [key for key in aDict.keys() if searchString in key]
    return filteredKeys

def naiveSearch(aDict, searchString):
    filteredKeys = []
    for key in aDict:
        if searchString in key: 
            filteredKeys.append(key)
    return filteredKeys

def insertIntoDB(aDict):
    conn = sqlite3.connect('/tmp/dictdb')
    c = conn.cursor()
    c.execute('DROP TABLE IF EXISTS BLAH')
    c.execute('CREATE TABLE BLAH (KEY TEXT PRIMARY KEY, VALUE TEXT)')
    for key in aDict:
        c.execute('INSERT INTO BLAH VALUES(?,?)',(key, aDict[key]))
    return conn

def dbSearch(conn):
    cursor = conn.cursor()
    cursor.execute("SELECT KEY FROM BLAH WHERE KEY GLOB '*log*'")
    return [record[0] for record in cursor]

if __name__ == '__main__':
    aDict = buildDict(400000)
    conn = insertIntoDB(aDict)
    startTimeNaive = time.time()
    for i in xrange(3):
        naiveResults = naiveSearch(aDict, 'log')
    endTimeNaive = time.time()
    print 'Time taken for 3 iterations of naive search was', (endTimeNaive-startTimeNaive), 'and the average time per run was', (endTimeNaive-startTimeNaive)/3.0

    startTimeNaiveLC = time.time()
    for i in xrange(3):
        naiveLCResults = naiveLCSearch(aDict, 'log')
    endTimeNaiveLC = time.time()
    print 'Time taken for 3 iterations of naive search with list comprehensions was', (endTimeNaiveLC-startTimeNaiveLC), 'and the average time per run was', (endTimeNaiveLC-startTimeNaiveLC)/3.0

    startTimeDB = time.time()
    for i in xrange(3):
        dbResults = dbSearch(conn)
    endTimeDB = time.time()
    print 'Time taken for 3 iterations of DB search was', (endTimeDB-startTimeDB), 'and the average time per run was', (endTimeDB-startTimeDB)/3.0


    os.remove('/tmp/dictdb')

For the record, my results were:

Time taken for 3 iterations of naive search was 0.264658927917 and the average time per run was 0.0882196426392
Time taken for 3 iterations of naive search with list comprehensions was 0.403481960297 and the average time per run was 0.134493986766
Time taken for 3 iterations of DB search was 1.19464492798 and the average time per run was 0.398214975993

All times are in seconds.

Chinmay Kanchi
  • 62,729
  • 22
  • 87
  • 114
  • 4
    I would shorten the above to "However, if you have 400,000 records, use a database." – senderle Mar 02 '11 at 22:58
  • "Another option is to use list comprehensions, as these are almost always faster than the equivalent loops." Source? – Mikel Mar 02 '11 at 23:00
  • @Mikel, I don't know a source, but I believe Chinmay is right, simply based on experience. – senderle Mar 02 '11 at 23:02
  • @Mikel: http://wiki.python.org/moin/PythonSpeed/PerformanceTips . Look in the section on Loops. – Chinmay Kanchi Mar 02 '11 at 23:03
  • @Chinmay But we don't know if he needs to process the entire set of keys. Doing that list comprehension is likely to be costly unless userInput is not actually in the keys at all. I suggest using a generator instead of a list comprehension. Then you can break out after the first match is found. – kojiro Mar 02 '11 at 23:19
  • @kojiro: Good point, edited my answer. Worst-case will still be the same, since there is no guarantee about the order of the keys, but should be much better in the average and best cases. – Chinmay Kanchi Mar 02 '11 at 23:30
  • @Chinmay: remove the `.keys()` – John Machin Mar 03 '11 at 03:39
  • 1
    @John: I prefer to leave it in there, as I find it makes the code more obvious as to its intent. I realise that it's not necessary though. – Chinmay Kanchi Mar 03 '11 at 03:45
  • @Chinmay, sadly (from a readability standpoint) I think John is right; `in mydict.keys()` searches a list, not a dict, and is terribly slow. (Not that it matters now!) – senderle Mar 03 '11 at 04:19
  • Flummoxing that the list comprehension is slower... did you try it with a generator? – senderle Mar 03 '11 at 04:50
  • Oh! You're creating a `.keys()` list there too. I tested this with `filteredKeys = [key for key in aDict if searchString in key]` and it was almost exactly the same speed as the naive loop. Then I tested it with `filteredKeys = (key for key in aDict.keys() if searchString in key)` and it was -- get this -- _four orders of magnitude faster_. Then I realized that's because the generator doesn't actually run :). Oh well. – senderle Mar 03 '11 at 05:17
3

If you only need to find keys that start with a prefix then you can use a binary search. Something like this will do the job:

import bisect
words = sorted("""
a b c stack stacey stackoverflow stacked star stare x y z
""".split())
n = len(words)
print n, "words"
print words
print
tests = sorted("""
r s ss st sta stack star stare stop su t
""".split())
for test in tests:
    i = bisect.bisect_left(words, test)
    if words[i] < test: i += 1
    print test, i
    while i < n and words[i].startswith(test):
        print i, words[i]
        i += 1

Output:

12 words
['a', 'b', 'c', 'stacey', 'stack', 'stacked', 'stackoverflow', 'star', 'stare',
'x', 'y', 'z']

r 3
s 3
3 stacey
4 stack
5 stacked
6 stackoverflow
7 star
8 stare
ss 3
st 3
3 stacey
4 stack
5 stacked
6 stackoverflow
7 star
8 stare
sta 3
3 stacey
4 stack
5 stacked
6 stackoverflow
7 star
8 stare
stack 4
4 stack
5 stacked
6 stackoverflow
star 7
7 star
8 stare
stare 8
8 stare
stop 9
su 9
t 9
John Machin
  • 81,303
  • 11
  • 141
  • 189
1

dpath can solve this for you easily.

http://github.com/akesterson/dpath-python

$ easy_install dpath
>>> for (path, value) in dpath.util.search(MY_DICT, "glob/to/start/{}".format(userinput), yielded=True):
>>> ...    # (do something with the path and value)

You can pass an eglob ('path//to//something/[0-9a-z]') for advanced searching.

1

You could join all the keys into one long string with a suitable separator character and use the find method of the string. That is pretty fast.

Perhaps this code is helpful to you. The search method returns a list of dictionary values whose keys contain the substring key.

class DictLookupBySubstr(object):
    def __init__(self, dictionary, separator='\n'):
        self.dic = dictionary
        self.sep = separator
        self.txt = separator.join(dictionary.keys())+separator

    def search(self, key):
        res = []
        i = self.txt.find(key)
        while i >= 0:
            left = self.txt.rfind(self.sep, 0, i) + 1
            right = self.txt.find(self.sep, i)
            dic_key = self.txt[left:right]
            res.append(self.dic[dic_key])
            i = self.txt.find(key, right+1)
        return res
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
0

Perhaps using has_key solve this too.

http://docs.python.org/release/2.5.2/lib/typesmapping.html

neosergio
  • 452
  • 5
  • 15