0

I am working for some string matching problems and use fuzzywuzzy module to get score.

My targeted data is around 67K and reference data is almost 4M, I created loop and one iteration is taking around +- 19minutes. Is there any way to make my loop run faster?

%%timeit
df11['NEW'] = ""
for i in range(0, 4):
    df11['NEW'] = process.extractOne(df11['Desc 1'][i], df['Description 2'])

df11.head()
Guy
  • 46,488
  • 10
  • 44
  • 88
  • This takes as long as it takes. If you want it to be faster, you'll have to be innovative – zvone Nov 11 '19 at 08:32
  • Thanks you for the update.. i've post this question here because i wanted to find a innovative way as currently I did not know any. – adhvait kansara Nov 11 '19 at 08:35
  • By innovative I think 'zvone' means re-write it in C! That's what I did to speed up a test of RSA. It still took a week! – Paula Thomas Nov 11 '19 at 08:35
  • This is the place to find solutions to known problems. Innovative is the opposite of that ;) I can give you a hint: don't use the slow library, implement a faster one. Or, don't use big sets of data, use smaller ones – zvone Nov 11 '19 at 08:38
  • You should ask this in [codereview](https://codereview.stackexchange.com/) – Guy Nov 11 '19 at 08:39
  • @zvone - i've tried it on smaller data and it worked fine. but here is the problem i need it to run for my entire dataset. you told me to be innovative and i've post my problem to find the solution only.. :) – adhvait kansara Nov 11 '19 at 08:49
  • Did you look into the [multiprocessing](https://docs.python.org/3.8/library/multiprocessing.html) package yet? I'm not sure, but it might help you. – jofrev Nov 11 '19 at 09:05
  • when you say "67K" do you mean that you have 67 thousand items in your list, or that it's 67 kilobytes of data? same with the "4M" – Sam Mason Nov 11 '19 at 09:36
  • @SamMason 67 thousand data wanted to match with 4 million data. – adhvait kansara Nov 11 '19 at 09:52
  • my question obviously wasn't clear, lets try again! what are "data"? are these rows, the total size of your input file, or something else? – Sam Mason Nov 11 '19 at 10:06
  • I'd say like a good place to use Spark...Also does your "processing time" include the loading of your data which to me is massive for any regular PC/laptop setup. Might wanna profile to see where your processing is using its time before deciding what you can do to improve speed. – Jason Chia Nov 11 '19 at 10:24
  • @SamMason both are rows 67K and 4M not size of my input file.also it didn't havd any problem while loading the dataset. both works fine. both the data contains strings for eg. "My address is xyz" and "My add is xy". – adhvait kansara Nov 11 '19 at 11:39

1 Answers1

0

assuming:

  1. that the target/choice strings are all relatively long (e.g. >20 character) and they're not all very similar (e.g. just one or two characters different)
  2. the edit distance between the query and "best" target is relatively small (e.g. <10% characters modified)

then I'd probably use trigrams to index the strings and then ignore target lines that don't have enough trigrams from the queries

I've been having a play with the "20 newsgroup dataset" and it takes my laptop:

  • 45 seconds to run fuzzywuzzy.extractOne using these lines as the choices/target
  • 0.3 seconds to find the nearest string using trigrams

this was after taking:

  1. 6 seconds to load 477948 lines of text from 18828 emails
  2. 15 seconds to turn the lines into a dictionary of 317324 trigrams

my code is pretty hacky but I could tidy it up, would probably reduce total runtime to a day or so for all 67k of your query strings, maybe just a few hours if you did this in parallel with multiprocessing

Sam Mason
  • 15,216
  • 1
  • 41
  • 60