I am trying to create a list with the edit distances between each word in a set of documents, ranging from 10k-42k words. If my idea of edit distance is correct, I would end up with a distance for each word compared to every single other word. So if I had a corpus of 10k words each word would have 9,999 distances associated with it. Is there any way to optimize this or possibly a completely different approach as the run time is unreasonably long. I am using JW Distance and I have the following code:
ru_distances = []
# ru_final is a list of tokenized words
for i, j in enumerate(ru_final[:-1]):
a = j
b = ru_final[i + 1]
dist = distance.get_jaro_distance(a, b, winkler=True, scaling=0.1)
ru_distances.append((a, b, dist))