How Can I Optimize This Recursive Edit Distance Function or Its Associated Data Filtration Function?

Question

I built a recursive function to calculate the edit distance between two strings, which I need to iterate over thousands of distinct sentences in order to construct several JSON files for an app I'm updating. The edit distance function is giving good results, but I think it could stand to be simplified:

def rec_edit_dist(of_str_a: str, of_str_b: str) -> int:
    if of_str_a == str() or of_str_b == str():
        return len(of_str_a) if of_str_b == str() else len(of_str_b)
    if len(of_str_a) < len(of_str_b):
        short_str, long_str = of_str_a, of_str_b
    else:
        short_str, long_str = of_str_b, of_str_a
    del of_str_a, of_str_b
    s, e = 0, 1
    best_match: str = str()
    for e in range(1, len(short_str) + 1):
        substr = short_str[s:e]
        if substr in long_str:
            if len(best_match) < len(substr):
                best_match = substr
        else:
            s += 1
    if best_match == str():
        return len(long_str)
    short_split = short_str.split(best_match, 1)
    long_split = long_str.split(best_match, 1)
    return sum(rec_edit_dist(a, b) for a, b in zip(short_split, long_split))


def filt_by_best_dists(this_str: str,
                       against_strs: List[str]) -> List[Tuple[str, int]]:
    ts_len = len(this_str)
    against_strs.sort(key=lambda i: abs(len(i) - ts_len))
    best_dist = 1000
    keeps: List[Tuple[str, int]] = list()
    for that_str in against_strs:
        if abs(len(that_str) - ts_len) > best_dist:
            break
        that_ed = rec_edit_dist(of_str_a=this_str, of_str_b=that_str)
        if that_ed > 0 and that_ed < best_dist:
            best_dist = that_ed
            keeps += [(that_str, that_ed)]
    return keeps

Essentially, filt_by_best_dists avoids checking overlong and under-long entries as better options get found, and the rec_edit_dist just weeds out the longest common substrings (as 0 to the edit distance) until either split string is blank, at which point it returns the longer of the two.

Are there any other tricks that I could employ to get these two functions running as fast as possible? I planned to run this on 600 threads to cover all of the information I need, and the against_strs parameter has up to thousands of strings.

Have you considered compiling the code using `numba`? I have used numba mostly for numbers though, not sure if there are some caveats for using numba and strings. I can without a doubt recommend using `line_profiler` to see what exactly is your bottleneck and focus on those lines. — dankal444, May 10 '22 at 09:20
Also, providing [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) would help others help you — dankal444, May 10 '22 at 09:24

How Can I Optimize This Recursive Edit Distance Function or Its Associated Data Filtration Function?

0 Answers0