6

Algorithms for edit distance give a measure of the distance between two strings.

Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:

Obaama Obama => should probably be merged

Obama Ibama => should not be merged.

This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
seinecle
  • 10,118
  • 14
  • 61
  • 120
  • 1
    Are you looking for an NLP method or can Information-Retrieval solution fit? Can you describe where are you going to use it, and how popular are the words you are looking for? Do you have a collection of documents you can use to "learn" patterns? – amit Aug 12 '12 at 07:51
  • You need extra data (e.g. some stats on misspellings) in order to "minimize false positives" because in order to minimize the error you need to know what an error is and is not. – Alexey Frunze Aug 12 '12 at 08:17
  • Soundex was developed for this sort of usage. It's very specific to English, though (but so is your problem). – tripleee Aug 12 '12 at 09:51

2 Answers2

5

I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.

Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.

Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.

The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.

Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).

Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.


1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.

2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.

amit
  • 175,853
  • 27
  • 231
  • 333
2

You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Kenston Choi
  • 2,862
  • 1
  • 27
  • 37