Implementing hierarchical clustering in Python using Levenshtein distance

Question

Following up on my previous question, I have implemented a clustering algorithm for a huge number of strings using Python & Levenshtein distance..But it is taking a very long time to complete clustering. Any suggestions please?

<> iterate thro the list in a for loop for each item in list run through the list again, to find similarity percentage if similarity > threshold, move to cluster end for loop

Rewrite the hot parts in Cython. – Has QUIT--Anony-Mousse Apr 18 '16 at 05:51 — Has QUIT--Anony-Mousse, Apr 18 '16 at 05:51

score 0 · Accepted Answer · answered Apr 17 '16 at 14:47

0

First, use a profiler to see where most of the time is spent. I suspect it's in the actual Levenshtein calculation, but it's good to be sure. Iff it is:

Implement the Levenshtein function with Cython. This will give you a massive speed-up already.
Calculate the pairs in multiple threads. E.g. If you have 1000 strings, you have 1000000 pairs, so you can have each of 8 threads do 125000 of the pairs.

answered Apr 17 '16 at 14:47

Claudiu

224,032
165
485
680

thanks a lot.. let me check on this.. – G Narayan Apr 18 '16 at 11:52
yes Levenshtein function is taking 90% of the time.. – G Narayan Apr 22 '16 at 14:27

Implementing hierarchical clustering in Python using Levenshtein distance

1 Answers1