Using a Probabilistic Data Structure to Do Text Matching (Python)

Question

I have a list of 10,000,000 strings, each is a name of an item. 3 to 5 words, up to 80 characters.

Then I have a list of 5,000 strings to match on. Meaning, for each of the 5,000 potential match rules, I need to identify how many of the 10,000,000 strings it matches to.

Thus far I have iterated pairwise using something like the below:

def contains_word_fast2(check):
    counter = 0
    try:
        for item in my_list_of_10mm_items:
            if " "+check+" " in item or item.startswith(check + " ") \
                or item.endswith(" " +check):
                    counter += 1
    except:
        return 0

    return counter

I have it working just fine on a subset of the 10,000,000 strings, but I do not know how I can scale the algorithm. Any suggestions using probabilistic data structures? I understand that a MinHash or BloomFilter may have some potential, but I cannot wrap my head around how it would apply to this problem.

Could you be more specific about the exact problem that comes with scaling? Are you concerned about memory or evaluation time? Then regarding the problem itself: Is it possible that the item list contains duplicates? What are the matching criteria? From your `if` clause it looks like you should `strip` the `item` first and then compare for equality. In general *regular expressions* (for python the `re` package) have a high performance for string processing. — a_guest, Dec 15 '16 at 17:58
@a_guest - evaluation time is main concern. I did clean the data prior. I actually found the opposite. Using a var in item is about 100x faster in this case than checking the return value of an re.search. — jrjames83, Dec 15 '16 at 18:05
Possible duplicate of [Modern, high performance bloom filter in Python?](http://stackoverflow.com/questions/311202/modern-high-performance-bloom-filter-in-python) — mVChr, Dec 15 '16 at 18:43
@jrjames83 You should use `re.match` and pre-compile the regular expression via `re.compile`. However for mere string equality I don't think there will be a benefit. In any case you should prepare your data. From your comparison it seems like you should `strip` each item. Also in your case it might be useful to sort the items by their first letter in a dictionary and then check only those for which the first letter matches. I.e. if `check` denotes your reference item, then you only check `items[check[0]]`. You could nest this deeper by taking the 2nd letter into account as well (and so on). — a_guest, Dec 16 '16 at 09:10

Using a Probabilistic Data Structure to Do Text Matching (Python)

0 Answers0