0

I'm trying to cluster similar patterns in a list using Affinity Propagation clustering method. The self_pat is a list containing 80K patterns that needs to be clustered. I'm using the following code:

self_pat = np.asarray(self_pat) #So that indexing with a list will work
lev_similarity = -1*np.array([[calculate_levenshtein_distance(w1,w2) for w1 in self_pat] for w2 in self_pat])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)

affprop.fit(lev_similarity)

 

for cluster_id in np.unique(affprop.labels_):

    exemplar = words_pat[affprop.cluster_centers_indices_[cluster_id]]

    cluster = np.unique(words_pat[np.nonzero(affprop.labels_==cluster_id)])

    cluster_str = ", ".join(cluster)

    print(" - *%s:* %s" % (exemplar, cluster_str))

The calculate_levenshtein_distance function is as follows:

def calculate_levenshtein_distance(str_1, str_2):
    """
        The Levenshtein distance is a string metric for measuring the difference between two sequences.
        It is calculated as the minimum number of single-character edits necessary to transform one string into another
    """
    distance = 0
    buffer_removed = buffer_added = 0
    for x in ndiff(str_1, str_2):
        code = x[0]
        # Code ? is ignored as it does not translate to any modification
        if code == ' ':
            distance += max(buffer_removed, buffer_added)
            buffer_removed = buffer_added = 0
        elif code == '-':
            buffer_removed += 1
        elif code == '+':
            buffer_added += 1
    distance += max(buffer_removed, buffer_added)
    return distance

The above program uses 3 loops for execution and thus takes more time to do the clustering. Is there any way I can reduce the complexity of the program?

Aafaz
  • 51
  • 4
  • Is there a reason you're not using [Levenshtein](https://pypi.org/project/python-Levenshtein/)? It would be much faster than pure Python. – DarrylG Mar 30 '21 at 16:31
  • I tried installing the Levenshtein package using `pip3 install python-levenshtein` command but it gives and an error `could not find a version that satisfies the requirement python-levenshtein`. – Aafaz Mar 30 '21 at 16:56
  • @Aafaz--I remember having a similar problem over a year ago (i.e. incompatibility of different libraries including Levenshtein, Numpy, Pandas, etc.). My solution was a new virtual environment where I installed fresh packages (i.e. latest Python, Pandas, Numpy, etc.). On a simple test (i.e. using "hello" vs "jello" the library package was over 200 times faster than the calculate_levenshtein_distance code. – DarrylG Mar 30 '21 at 17:32
  • @Aafaz--80K patterns implies lev_similarity has(80Kx80K) = 6.4e9 elements. `calculate_levenshtein_distance` takes 55 us on simple word pairs (my machine) which comes to 98 hours for 6.4e9 pairs. Using the Levenshtein module this reduces to ~0.239 us per pair or 0.42 hours for 6.4e9 pairs. Thus, it worthwhile to use the Levenshtein module. – DarrylG Mar 30 '21 at 23:25

1 Answers1

0

For smaller datasets, the time to completions is usually fine; for very large datasets, the time it takes to complete a job is basically intolerable. Clustering doesn't scale well, as you are finding out. Maybe you can just take a random sample from your full dataset.

# Fraction of rows
# here you get .25 % of the rows
df.sample(frac = 0.25)
ASH
  • 20,759
  • 19
  • 87
  • 200