Robust non-phonetic non-intensive fuzzy substring match

Question

If you are writing code to fuzzily match two strings, e.g. "coca-cola" vs. "koca-cola", there are some standard ways of doing it, e.g.

comparing the Levenshtein edit distance (http://en.wikipedia.org/wiki/Levenshtein_distance)
computing phonetic-based hashes of each string (e.g. Double Metaphone) and comparing.

However, I can't find information about a standard and efficient way of doing this for substrings. E.g. for the input "tell me about coca-kola" (the 'haystack'), you want to pick up the company "Coca-Cola" (the 'needle').

You can't use a modified Levenshtein algorithm because you may have millions of needles (companies in your DB) and that would be too resource-intensive. Potentially you could calculate a phonetic hash of each word in the haystack and compare with each needle, but the phonetic representation also has lots of limitations and I am wondering if there is a well established standard for handling this problem that doesn't use phonetics?

I am looking for a simple easy to understand algorithm that scales well. There are similar questions already posted where answers suggested e.g. the Bitap algorithm, but like Levenshtein this doesn't appear to scale.

Welcome to Stack Overflow! Are you looking for [Named-Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition)? — arturomp, Oct 09 '13 at 16:06
Thank you, yes now I have tagged it with Named Entity Recognition as I think it also falls under that category - however the question is relating to one specific problem, which is recognising named entities when the entities could be a fuzzy match, and I can't find a good efficient technique in this case. All the information I can find is about recognising entities by context, or by content when there is an exact match, or fuzzy *string* matches instead of fuzzy *substring* matches. — Tom, Oct 10 '13 at 08:30

Robust non-phonetic non-intensive fuzzy substring match

0 Answers0