0

I am trying to create a list with the edit distances between each word in a set of documents, ranging from 10k-42k words. If my idea of edit distance is correct, I would end up with a distance for each word compared to every single other word. So if I had a corpus of 10k words each word would have 9,999 distances associated with it. Is there any way to optimize this or possibly a completely different approach as the run time is unreasonably long. I am using JW Distance and I have the following code:

ru_distances = []
# ru_final is a list of tokenized words
for i, j in enumerate(ru_final[:-1]):
    a = j
    b = ru_final[i + 1]
    dist = distance.get_jaro_distance(a, b, winkler=True, scaling=0.1)
    ru_distances.append((a, b, dist))
dmoses
  • 15
  • 5
  • 1
    this is an [xyproblem](https://xyproblem.info/). What is the purpose of the edit distances list – Alexander Dec 06 '22 at 23:26
  • The end goal is to cluster words with the lowest edit distance together. – dmoses Dec 06 '22 at 23:35
  • 1
    your current doesn't even do what you describe. All it is doing is comparing a word to another word that is immediately in front of it, and then moves on the the next word. That has an O(n) runtime where n is the length of ru_final. What you talk about in your description sounds more like O(n*n) runtime – Alexander Dec 06 '22 at 23:51
  • I agree with Alexander, the design is likely wrong because the definition of the task is not clear. First question: are sure you're interested in clustering words with similar *spelling*? (not with similar *meaning*). Assuming yes, you certainly need a double loop as Alexander said. Finally about the efficiency issue you could look at using the [blocking](https://en.wikipedia.org/wiki/Record_linkage) technique. – Erwan Dec 07 '22 at 11:24
  • Yeah, I recognized that what I wrote was not doing what my goal was, I couldn't figure out a way to do something like you mentioned which is O(n*n) within a reasonable run time. Also, I have another program that is comparing the meanings of words, I am going to end up combining them. – dmoses Dec 07 '22 at 22:24

0 Answers0