If you are writing code to fuzzily match two strings, e.g. "coca-cola" vs. "koca-cola", there are some standard ways of doing it, e.g.
- comparing the Levenshtein edit distance (http://en.wikipedia.org/wiki/Levenshtein_distance)
- computing phonetic-based hashes of each string (e.g. Double Metaphone) and comparing.
However, I can't find information about a standard and efficient way of doing this for substrings. E.g. for the input "tell me about coca-kola" (the 'haystack'), you want to pick up the company "Coca-Cola" (the 'needle').
You can't use a modified Levenshtein algorithm because you may have millions of needles (companies in your DB) and that would be too resource-intensive. Potentially you could calculate a phonetic hash of each word in the haystack and compare with each needle, but the phonetic representation also has lots of limitations and I am wondering if there is a well established standard for handling this problem that doesn't use phonetics?
I am looking for a simple easy to understand algorithm that scales well. There are similar questions already posted where answers suggested e.g. the Bitap algorithm, but like Levenshtein this doesn't appear to scale.