Searching approx 150,000 words in a file that contains approx 3 million words

Question

I am working on text summarization and to build my vocabulary, I have trained a dataset. Now I need vectors of those vocab words from Google's Word2Vec. I've written simple code that takes each word and searches for it in the google-vectors file that contains around 3 million words. But the problem is, that this sort of linear searching would literally take weeks to compute. I am using python for this thing. How can I search for these words in a more efficient manner?

found_counter = 0
file1 = open('vocab_training.txt', 'r').read()
for i, line in enumerate(file1):
    if i >= 50:
        break
    file2 = open('google-vectors.txt', 'r' )
    for j, line2 in enumerate(file2):
        if line.lower() == line2.split():
            found_counter += 1
    file2.close()
print(found_counter)

I would take the smaller file and try to load it in a dictionary (or list), then just loop through the larger file and find matches in the dictionary or list. Now, if memory is a problem (and it might), then `divide and conquer` is next: instead of loading all 150K words at once, just load in chunks and proceed. — sal, Aug 13 '17 at 10:30
or You can use some in-memory database instead of doing it on raw data. The database systems got special mechanisms to search data like indexes and etc. — Take_Care_, Aug 13 '17 at 10:37
3 million words should take on the order of 150MB if you load it into a Python set. 150k words might take around 5MB. These are small data-sets and there's no problem reading everything into RAM. — Paul Hankin, Aug 13 '17 at 10:38
If you can make use of regex, you can improve the performance — vjnan369, Aug 13 '17 at 10:41
I'm voting to close this question as off-topic because questions asking for improvements to working code should be asked on Code Review, not Stack Overflow. — TylerH, Feb 28 '18 at 15:09

score 0 · Answer 1 · answered Aug 13 '17 at 10:44

Option: load the 3 million words into memory in a hash table and check for membership - in Python you would be keeping a set:

with open('google-vectors.txt', 'r') as f:
  words = set(l.lower() for l in f)

...
  if line.lower in words:
    ...

Other options:

Keep a sorted list with log(n) lookup via binary search (a hash beats that)
If there isn't enough memory to keep the set in memory, initialize a cuckoo filter, bloom filter, or other "approximate membership query" structure with the contents of the words set. Test for membership in the filter first - if you get a hit that means you may have a real hit, and THEN you can go to a slower query method. You can get a low-enough false-positive for this to be a good option.
If too big to keep in memory, keep the data on disk or elsewhere in a way that's easy to query. Some examples built-into python are dbm, shelve, and sqlite3. If using e.g sqlite3 make sure to index the data. You can even run a local networking key-value store like Redis and still get much better performance than re-iterating over the list.

Searching approx 150,000 words in a file that contains approx 3 million words

1 Answers1