1

I have a list of 10,000,000 strings, each is a name of an item. 3 to 5 words, up to 80 characters.

Then I have a list of 5,000 strings to match on. Meaning, for each of the 5,000 potential match rules, I need to identify how many of the 10,000,000 strings it matches to.

Thus far I have iterated pairwise using something like the below:

def contains_word_fast2(check):
    counter = 0
    try:
        for item in my_list_of_10mm_items:
            if " "+check+" " in item or item.startswith(check + " ") \
                or item.endswith(" " +check):
                    counter += 1
    except:
        return 0

    return counter

I have it working just fine on a subset of the 10,000,000 strings, but I do not know how I can scale the algorithm. Any suggestions using probabilistic data structures? I understand that a MinHash or BloomFilter may have some potential, but I cannot wrap my head around how it would apply to this problem.

jrjames83
  • 901
  • 2
  • 9
  • 22
  • Could you be more specific about the exact problem that comes with scaling? Are you concerned about memory or evaluation time? Then regarding the problem itself: Is it possible that the item list contains duplicates? What are the matching criteria? From your `if` clause it looks like you should `strip` the `item` first and then compare for equality. In general *regular expressions* (for python the `re` package) have a high performance for string processing. – a_guest Dec 15 '16 at 17:58
  • @a_guest - evaluation time is main concern. I did clean the data prior. I actually found the opposite. Using a var in item is about 100x faster in this case than checking the return value of an re.search. – jrjames83 Dec 15 '16 at 18:05
  • Possible duplicate of [Modern, high performance bloom filter in Python?](http://stackoverflow.com/questions/311202/modern-high-performance-bloom-filter-in-python) – mVChr Dec 15 '16 at 18:43
  • @jrjames83 You should use `re.match` and pre-compile the regular expression via `re.compile`. However for mere string equality I don't think there will be a benefit. In any case you should prepare your data. From your comparison it seems like you should `strip` each item. Also in your case it might be useful to sort the items by their first letter in a dictionary and then check only those for which the first letter matches. I.e. if `check` denotes your reference item, then you only check `items[check[0]]`. You could nest this deeper by taking the 2nd letter into account as well (and so on). – a_guest Dec 16 '16 at 09:10

0 Answers0