Quick search for a start location of a given string

Question

Here, I would like to match a given string match_text to a longer string text. I want to find match_text's start location in text, the closest one (you can assume that there is only one location). My current version of the code is to for loop through a range of text and calculate the Levenshtein distance. However, sometimes the text is really long (up to 90k characters). I'm not sure if there is a fast way to do the string search. Here is the current version of the snippet that I wrote:

import numpy as np
import Levenshtein as lev # pip install python-Levenshtein

def find_start_position(text, match_text):
    lev_distances = []
    for i in range(len(text) - len(match_text)):
        match_len = len(match_text)
        lev_distances.append(lev.distance(match_text, text[i: i + match_len]))
    pos = np.argmin(lev_distances)
    return pos

# example
find_start_position('I think this is really cool.', 'this iz')
>> 8

I would appreciate if someone knows or has a quick string search.

Definitely, it's possible to do by word also. I just need to get the start character in the end. — titipata, Jul 19 '19 at 17:40
Do you have need for specific python version or just 'latest'? — harry hartmann, Jul 19 '19 at 20:45
Also your snippet 'find_start_position' doesn't deliver what it says: 'position'. Instead it states just like 'yes, you have to invest XYZ costs' if you want convert (transform) your match_text in to some text-substring (or vice versa). — harry hartmann, Jul 19 '19 at 21:16

harry hartmann · Answer 1 · 2019-07-20T17:14:51.040

be aware: white spaces in patterns are normalized. Is this what you are looking for?

import Levenshtein as lev # pip install python-Levenshtein
import sys
# author hry@solaris-it.com

def splitTextInWords(text):
    retVal = text.split() 
    return retVal

def getBestFit(allLevenshteinValues):
    bestFit = [sys.maxsize, '', 0]
    for k, value in allLevenshteinValues.items():
        if value[0] < bestFit[0]:
            bestFit = value
            bestFit.append(k + 1)       
    return bestFit

def catchAllCosts(text, matchText):
    textAsWordList   = splitTextInWords(text)
    matchTextPattern = ' '.join(splitTextInWords(matchText))
    maxIndx = len(textAsWordList)
    allLevenshteinValues = {}
    for i in range(0, maxIndx):
        extCnt = 0
        textPattern = textAsWordList[i]
        while (len(textPattern) < len(matchTextPattern) 
        and i + extCnt + 1 < maxIndx):
            if i + extCnt + 1  < maxIndx:
                extCnt += 1
            textPattern = ' '.join([textPattern, textAsWordList[i + extCnt]])
        allLevenshteinValues[i] = [ lev.distance(
        textPattern, matchTextPattern), textPattern ]
    return allLevenshteinValues

def main():
    # text: pattern you are crowling
    text = '''x AlongLongLongWord and long long long long string 
    is going be  here string I think    really S is cXXXl. 
    x AlongLongLongWord 今x  Go今天今 I think really this would is cxol.x 
    AlongLongLongWord I think this izreally this iz cool.''' 
    # matchText: pattern you want find the best match for
    matchText = 'this is'

    allLevenshteinValues = catchAllCosts(text, matchText)
    bestFit =  getBestFit(allLevenshteinValues)
    costs, sequence, wordNr,   = bestFit[0], bestFit[1], bestFit[2]
    print("first best match starting by word nr.",
          wordNr, "costs:", costs, "sequence: >>", sequence, "<<")

    matchAnotherPattern = '今天  Go今x天今'
    allLevenshteinValues = catchAllCosts(text, matchAnotherPattern)
    bestFit =  getBestFit(allLevenshteinValues)
    costs, sequence, wordNr,   = bestFit[0], bestFit[1], bestFit[2]
    print("first best match starting by word nr.",
          wordNr, "costs:", costs, "sequence: >>", sequence, "<<")


if __name__ == '__main__':
    main()

Quick search for a start location of a given string

1 Answers1