1

here is my code. it takes 17 hours to complete.could you please suggest any alternative code to reduce the time of computation?

# test algorithm1 - fuzzy
matched_pair = []
for x in dataset1['full_name_eng']:
    for y in dataset2['name']:
        if (fuzz.token_sort_ratio(x,y) > 85):
            matched_pair.append((x,y))
            print((x,y))

I tried different ones but did not work ((.

dataset1 - 10krows, dataset2 - 1M rows, fuzz.token_sort_ratio(x,y) - is a function which takes 2 parameters (2strings) and outputs integer - the similarity of these 2 strings

  • 2
    Can you provide more details? What is dataset1? how big is that? Can you post sample data of that? What is fuzz? – Shiva Apr 29 '20 at 07:28
  • 1
    Split your list and process it in parallel – xSparfuchs Apr 29 '20 at 07:34
  • 1
    Please see [ask], [help/on-topic]. – AMC Apr 29 '20 at 07:42
  • please, see I have edited question - added some details – Guram Keretchashvili Apr 29 '20 at 07:47
  • You can have a look at locality sensitive hashing (LSH) for faster similar string search. [Here is an article explaining it](https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134) – A Co Apr 29 '20 at 07:47
  • It seems that `dataset1` and `dataset2` are Pandas Data Frames. In this case, I suggest that you cross-join the two data frames, and then add a new column to the result containing `fuzz.token_sort_ratio` for the rows using `df.apply`. Eventually, you can draw out the rows with the required condition using Pandas selection mechanisms. This will surely speed up your code. – Mohammed Farahmand Apr 29 '20 at 07:50
  • @ACo thank you for the information, but it is not the problem of the fuzzy algorithm. – Guram Keretchashvili Apr 29 '20 at 07:54
  • @MohammedFarahmand as mentioned datasets are too large so if I cross join - the memory will not be enought – Guram Keretchashvili Apr 29 '20 at 07:55
  • @GuramKeretchashvili Have a look at https://stackoverflow.com/questions/42847396/fuzzy-wuzzy-string-matching-on-2-large-data-sets-based-on-a-condition-python. I think it's pretty similar to your problem. – Mohammed Farahmand Apr 29 '20 at 08:03
  • @GuramKeretchashvili LSH allows you to match your data in O(N) time, because it does not require to compare all elements one by one. The nested loop you are using now has a time complexity of O(N^2). That's why it is faster, regardless of the distance metric used. – A Co Apr 29 '20 at 08:03

1 Answers1

1

Since the dataframe is not really used here I will simply work with the following two lists:

import string
import random

random.seed(18)
dataset1 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]
dataset2 = [''.join(random.choice(string.ascii_lowercase + ' ') for _ in range(random.randint(13, 20))) for s in range(1000)]

using these two lists with the code you provided using fuzzywuzzy. As a first change you could use RapidFuzz (I am the author) which is basically doing the same as FuzzyWuzzy, but is quite a bit faster. When using my tests lists this was about 7 times as fast as your code. Another issue is that when using fuzz.token_sort_ratio the strings are always lowercased and e.g. punctuation is removed. While this makes sence for the string matching, your doing it multiple times for each string in the list, which adds up when working with bigger lists. Using RapidFuzz and preprocessing only once is about 14 times as fast on these lists.

from rapidfuzz import fuzz, utils

dataset2_processed = [utils.default_process(x) for x in dataset2]
dataset1_processed = [utils.default_process(x) for x in dataset1]

matched_pair = []
for word1, word1_processed in zip(dataset1, dataset1_processed):
    for word2, word2_processed in zip(dataset2, dataset2_processed):
        if fuzz.token_sort_ratio(word1_processed, word2_processed, processor=None, score_cutoff=85):
            matched_pair.append((word1, word2))
maxbachmann
  • 2,862
  • 1
  • 11
  • 35