6

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.

However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.

My code is something like:

words_in_line = []
for word in line:
    if word in my_list:
        words_in_line.append(word)

As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.

Do someone has an idea about how to do that in a better way?

Jiehong
  • 786
  • 1
  • 7
  • 16
  • 5
    Is there any reason you can't work with a set of words instead? There may be 600 million lines, but there are far fewer English words in use (even including leading and trailing punctuation, if you don't clean it.) Testing membership in a set should be very quick. – DSM Jun 08 '12 at 00:02
  • @DSM: O(1) in fact, assuming relatively few hash collisions :) – Joel Cornett Jun 08 '12 at 00:26
  • You can't check if an item is in a list efficiently. That's not what lists are for. You need to choose your data types (particularly collections) to be suitable for what you're going to do with them, because no data type is good at everything. – Ben Jun 08 '12 at 01:54

4 Answers4

19

You might consider a trie or a DAWG or a database. There are several Python implementations of the same.

Here is some relative timings for you to consider of a set vs a list:

import timeit
import random

with open('/usr/share/dict/words','r') as di:  # UNIX 250k unique word list 
    all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set)    # slightly faster if this list is sorted...      

test_list=[random.choice(all_words_list) for i in range(10000)] 
test_set=set(test_list)

def set_f():
    count = 0
    for word in test_set:
        if word in all_words_set: 
           count+=1
    return count

def list_f():
    count = 0
    for word in test_list:
        if word in all_words_list: 
           count+=1
    return count

def mix_f():
    # use list for source, set for membership testing
    count = 0
    for word in test_list:
        if word in all_words_set: 
           count+=1
    return count    

print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs" 
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs" 

Prints:

list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs

ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.

For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.

Conclusion: Use better data structures for 600 million lines of text!

the wolf
  • 34,510
  • 13
  • 53
  • 71
  • This sounds great. My first code needed about 500 days of calculus, and about 50 days with a clever re-factoring. Now, it only needs something like 1 hour! Even if my set is 200,000 long, that's impressive. – Jiehong Jun 08 '12 at 08:53
  • 1
    @user1443418: The key delaying factor is the Python operator `in` against a list. If you mix these two data structures and use a list for data access (i.e., use `for word in test_list`) and use set for membership storage (i.e., `if word in all_word_set`) it is even faster. Sets are way faster for membership testing; lists are faster to create access in a linear fashion. `Know your tools Luke.` – the wolf Jun 08 '12 at 09:40
  • It's what I've used after I saw your answer! Thanks again. – Jiehong Jun 08 '12 at 10:25
  • @Jiehong: Feel free to accept the answer if it helped you out. – the wolf Jun 08 '12 at 10:26
1

I'm not clear on why you chose a list in the first place, but here are some alternatives:

Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.

If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort: http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/

If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

Depending on what you're doing, a trie might also be very good.

user1277476
  • 2,871
  • 12
  • 10
0

This uses list comprehension

words_in_line = [word for word in line if word in my_list]

which would be more efficient than the code you posted, though how much more for your huge data set is hard to know.

Levon
  • 138,105
  • 33
  • 200
  • 191
  • Nope, that is not the sort an answer we are looking for in this case. This still does 600M O(n) operations (`if word in my_list`), it doesn't affect the real problem. – Nikana Reklawyks Oct 28 '17 at 10:04
0

There are two improvments you can make here.

  • Back your word list with a hashtable. This will afford you O(1) performance when you are checking if a word is present in your word list. There are a number of ways to do this; the most fitting in this scenario is to convert your list to a set.
  • Using a more appropriate structure for your matching-word collection.
    • If you need to store all of the matches in memory at the same time, use a dequeue, since its append performance is superior to lists.
    • If you don't need all the matches in memory at once, consider using a generator. A generator is used to iterate over matched values according to the logic you specify, but it only stores part of the resulting list in memory at a time. It may offer improved performance if you are experiencing I/O bottlenecks.

Below is an example implementation based on my suggestions (opting for a generator, since I can't imagine you need all those words in memory at once).

from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
    # Do something with matched_word
    print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()

input.txt

a b dog cat
c dog poop
maybe b cat
dog

Output

a
b
c
b
cheeken
  • 33,663
  • 4
  • 35
  • 42