I have a list of 10,000,000 strings, each is a name of an item. 3 to 5 words, up to 80 characters.
Then I have a list of 5,000 strings to match on. Meaning, for each of the 5,000 potential match rules, I need to identify how many of the 10,000,000 strings it matches to.
Thus far I have iterated pairwise using something like the below:
def contains_word_fast2(check):
counter = 0
try:
for item in my_list_of_10mm_items:
if " "+check+" " in item or item.startswith(check + " ") \
or item.endswith(" " +check):
counter += 1
except:
return 0
return counter
I have it working just fine on a subset of the 10,000,000 strings, but I do not know how I can scale the algorithm. Any suggestions using probabilistic data structures? I understand that a MinHash or BloomFilter may have some potential, but I cannot wrap my head around how it would apply to this problem.