0

If you are writing code to fuzzily match two strings, e.g. "coca-cola" vs. "koca-cola", there are some standard ways of doing it, e.g.

  1. comparing the Levenshtein edit distance (http://en.wikipedia.org/wiki/Levenshtein_distance)
  2. computing phonetic-based hashes of each string (e.g. Double Metaphone) and comparing.

However, I can't find information about a standard and efficient way of doing this for substrings. E.g. for the input "tell me about coca-kola" (the 'haystack'), you want to pick up the company "Coca-Cola" (the 'needle').

You can't use a modified Levenshtein algorithm because you may have millions of needles (companies in your DB) and that would be too resource-intensive. Potentially you could calculate a phonetic hash of each word in the haystack and compare with each needle, but the phonetic representation also has lots of limitations and I am wondering if there is a well established standard for handling this problem that doesn't use phonetics?

I am looking for a simple easy to understand algorithm that scales well. There are similar questions already posted where answers suggested e.g. the Bitap algorithm, but like Levenshtein this doesn't appear to scale.

Tom
  • 113
  • 1
  • 5
  • Welcome to Stack Overflow! Are you looking for [Named-Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition)? – arturomp Oct 09 '13 at 16:06
  • Thank you, yes now I have tagged it with Named Entity Recognition as I think it also falls under that category - however the question is relating to one specific problem, which is recognising named entities when the entities could be a fuzzy match, and I can't find a good efficient technique in this case. All the information I can find is about recognising entities by context, or by content when there is an exact match, or fuzzy *string* matches instead of fuzzy *substring* matches. – Tom Oct 10 '13 at 08:30

0 Answers0