I have two tables containing 2 million records each. One has the item names and other item description along with other attributes. I have to match each item in table 1 with each description in table 2 to find maximum similarity matches. So basically, for each of 2 million items, I have to scan the other table to find best match. That makes 2 million * 2 million computations! How do I go about doing that in python efficiently? As it stands now, it will take years to compute.
Right now the approach I am following is regex search by splitting each item name into words in a list and then checking if the word is contained in description or not.If yes, then I increase the match count by 1 and using that I calculate similarity.
So my question(s) is :
How to make my computations faster? Use multithreading, split data or anything like this?
Any other similarity algorithm that will work here? Please note that I have description on the other side, so cosine similarity etc don't work because of differing number of words.