I built a recursive function to calculate the edit distance between two strings, which I need to iterate over thousands of distinct sentences in order to construct several JSON files for an app I'm updating. The edit distance function is giving good results, but I think it could stand to be simplified:
def rec_edit_dist(of_str_a: str, of_str_b: str) -> int:
if of_str_a == str() or of_str_b == str():
return len(of_str_a) if of_str_b == str() else len(of_str_b)
if len(of_str_a) < len(of_str_b):
short_str, long_str = of_str_a, of_str_b
else:
short_str, long_str = of_str_b, of_str_a
del of_str_a, of_str_b
s, e = 0, 1
best_match: str = str()
for e in range(1, len(short_str) + 1):
substr = short_str[s:e]
if substr in long_str:
if len(best_match) < len(substr):
best_match = substr
else:
s += 1
if best_match == str():
return len(long_str)
short_split = short_str.split(best_match, 1)
long_split = long_str.split(best_match, 1)
return sum(rec_edit_dist(a, b) for a, b in zip(short_split, long_split))
def filt_by_best_dists(this_str: str,
against_strs: List[str]) -> List[Tuple[str, int]]:
ts_len = len(this_str)
against_strs.sort(key=lambda i: abs(len(i) - ts_len))
best_dist = 1000
keeps: List[Tuple[str, int]] = list()
for that_str in against_strs:
if abs(len(that_str) - ts_len) > best_dist:
break
that_ed = rec_edit_dist(of_str_a=this_str, of_str_b=that_str)
if that_ed > 0 and that_ed < best_dist:
best_dist = that_ed
keeps += [(that_str, that_ed)]
return keeps
Essentially, filt_by_best_dists
avoids checking overlong and under-long entries as better options get found, and the rec_edit_dist
just weeds out the longest common substrings (as 0 to the edit distance) until either split string is blank, at which point it returns the longer of the two.
Are there any other tricks that I could employ to get these two functions running as fast as possible? I planned to run this on 600 threads to cover all of the information I need, and the against_strs
parameter has up to thousands of strings.