1

How to check whether a substirng is inside a string with specific edit distance tolerance. For example:

str = 'Python is a multi-paradigm, dynamically typed, multipurpose programming language, designed to be quick (to learn, to use, and to understand), and to enforce a clean and uniform syntax.'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'
edit_distance_tolerance = 1

substr_in_str(str, substr1, edit_distance_tolerance)
>> True

substr_in_str(str, substr2, edit_distance_tolerance)
>> False

substr_in_str(str, substr3, edit_distance_tolerance)
>> True

What I tried: I tried to break the string in words and remove the special characters then do comparisons one by one but the performance(in terms of speed and accuracy) is not quite good.

R.yan
  • 2,214
  • 1
  • 16
  • 33

2 Answers2

0

The answer is not so simple as you think , and you will need a lot of mathematics to achieve this and standard re(regex) library can't solve this broblem . I think TRE library has solved this problem to a big extend , see here https://github.com/laurikari/tre/

0

Here is a recursive solution that I came up with, hope it's correct:

def substr_in_str_word(string, substr, edit_distance_tolerance):

    if edit_distance_tolerance<0:
        return False

    if len(substr) == 0:
        return True

    if len(string) == 0:
        return False

    for s1 in string:
        for s2 in substr:
            if s1==s2:
                return substr_in_str(string[1:],substr[1:], edit_distance_tolerance)
            else:
                return substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or \
            substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or\
            substr_in_str(string[1:],substr, edit_distance_tolerance-1) or \
            substr_in_str(string,substr[1:], edit_distance_tolerance-1)


def substr_in_str(string, substr, edit_distance_tolerance):
    for word in string.split(' '):
        if substr_in_str_word(word, substr, edit_distance_tolerance):
            return True
    return False          

Testing:

str = 'Python is a multi-paradigm'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'

edit_distance_tolerance = 1

print(substr_in_str(str, substr1, edit_distance_tolerance))
print(substr_in_str(str, substr2, edit_distance_tolerance))
print(substr_in_str(str, substr3, edit_distance_tolerance))

Output:

True
False
True
0x90
  • 39,472
  • 36
  • 165
  • 245