-3

Following up on my previous question, I have implemented a clustering algorithm for a huge number of strings using Python & Levenshtein distance..But it is taking a very long time to complete clustering. Any suggestions please?

<> iterate thro the list in a for loop for each item in list run through the list again, to find similarity percentage if similarity > threshold, move to cluster end for loop

G Narayan
  • 7
  • 3

1 Answers1

0

First, use a profiler to see where most of the time is spent. I suspect it's in the actual Levenshtein calculation, but it's good to be sure. Iff it is:

  1. Implement the Levenshtein function with Cython. This will give you a massive speed-up already.
  2. Calculate the pairs in multiple threads. E.g. If you have 1000 strings, you have 1000000 pairs, so you can have each of 8 threads do 125000 of the pairs.
Claudiu
  • 224,032
  • 165
  • 485
  • 680