7

My requirement is to find matching names for 2 list. One list has 400 names and second list has 90000 names. I got the desired result but process takes more than 35 mins. As it is obvious , there are 2 for loops so it takes O(N*N) operations which is the bottleneck. I have removed the duplicates in both the lists . Can you help improve it. I checked many other questions but somehow couldn't get that implemented. If you think I just missed reading some already existing post , please do point to that. I will try my best to understand and replicate that.

Below is my code

from fuzzywuzzy import fuzz
infile=open('names.txt','r')
name=infile.readline()
name_list=[]
while name:
    name_list.append(name.strip())
    name=infile.readline()

print (name_list)

infile2=open('names2.txt','r')
name2=infile2.readline()
name_list2=[]
while name2:
    name_list2.append(name2.strip())
    name2=infile2.readline()

print (name_list2)

response = {}
for name_to_find in name_list:
    for name_master in name_list2:
        if fuzz.ratio(name_to_find,name_master) > 90:
            response[name_to_find] = name_master
            break

for key, value in response.items():
    print ("Key is ->" + key + "  Value is -> " + value)
ashwin3086
  • 136
  • 1
  • 2
  • 8
  • Check if there are any duplicates in name_list and name_list2 and try to remove them before looping and see if it works. – Underoos Mar 03 '19 at 18:07
  • do you want the starting char of the name to be same as the master list. if so you can limit the second loop count based on the starting char prefix. create a dictionary based on the starting char and just do fuzz.ratio on the this limited set. – Pari Rajaram Mar 03 '19 at 18:51
  • @SukumarRdjf Thanks for the suggestion. I have already removed duplicates in both the lists. I will add that in the original Question as well. – ashwin3086 Mar 03 '19 at 18:51
  • @PariRajaram Can you please elaborate on what you mean by "starting char". Just an Example. Name in list 1 could be "Ash Jones" . Second list could be "Ashley Jones". If the match score is more than 90 , then it should return that. I am still figuring out the right score but essentially I would like it to work in that fashion. – ashwin3086 Mar 03 '19 at 18:55
  • you could create a dictionary (dict) using the starting one or two chars of the full_name as the key and the value as full_name. now for every word, you only have to iterate smaller set of names in dict[full_name[:2])] and do the fuzzywuzzy. – Pari Rajaram Mar 03 '19 at 22:56

2 Answers2

1

Without knowing the algorithm behind fuzz, I doubt there's much we can do to reduce the asymptotic runtime. There might be some tricks to prune obviously bad pairs, but probably not much beyond that. The other answer assumes you are doing an exact match- and will not work for fuzzy string matching.

What you can try to do is try to batch your calls, and hope fuzzywuzzy has optimized some logic for batches in its process. Something like

from fuzzywuzzy import process

for name in names400:
    matches = filter(lambda x: x[1] > 90, process.extract(name, names90000, limit=90000))
    for match_name, score in matches:
         response[match_name] = name

Also note that on the github page for fuzzywuzzy they mention that using python levenshtein can speedup computations by 4-10x.

Dillon Davis
  • 6,679
  • 2
  • 15
  • 37
  • @ashwin3086 yeah, after digging through their codebase, its all python, so there's no speedup like when use use builtin (C code) module functions. – Dillon Davis Mar 05 '19 at 03:22
  • 1
    Thanks for suggesting the Python Levenstein. It did help to a LARGE extent. Brought it down to 4 mins now. Above method (using Process) wasn't much useful compared to what I had written though. – ashwin3086 Mar 05 '19 at 03:24
  • Here are the stats for the FOR LOOP. Start Time is 1551756251.471381 End Time is 1551756415.6144588 --- 164.1430778503418 seconds --- – ashwin3086 Mar 05 '19 at 03:25
0

The most obvious approach is to use hash table. Pseudocode:

  1. Identify smaller list
  2. Create hash table based on smaller list:

    hash1 ={name: 1 for name in name_list}

  3. Iterate through the second list and check if name keys exist in the first list:

    l = [name for name in name_list2 if name in hash1]

that's it. you're getting a list of names that exist in both lists

Jarek.D
  • 1,274
  • 1
  • 8
  • 18
  • Thanks for the suggestion. Let me try that and get update the results. By the way , I am not looking for exact match / check existence in other list. I guess I can change the if condition to use fuzzywuzzy token_sort_ration. I'll try to modify that for my use case. – ashwin3086 Mar 03 '19 at 19:14
  • This doesnt seem to work Jarek as I am not looking for direct match. I have to do a fuzzy match to find closeness. – ashwin3086 Mar 05 '19 at 02:28
  • Sorry I think I've fixated on standard solution for matching two lists. Yes it wouldn't work for your case. That's actually a hard problem: https://en.wikipedia.org/wiki/Approximate_string_matching. But one possible optimization could be some kind of pre-bucketing step where you could use hashing to divide all strings into groups that are extremely dissimilar and use a simpler procedure for that like bucketing based on string length ranges so that 10 char strings wouldn't need to be compared to 5 char strings ect. Just an idea – Jarek.D Mar 05 '19 at 10:42