0

I'm searching for a "good" / easy metric to recognize similar places / user input to avoid crreating duplicates.

Levenstein distance works good for Typos like

bakery

bekerry

(Levenstein Distance: 2)

But "fails" for swapped words

St Ursula School

School St. Ursula

(Levenstein Distance: 17)

and additions

Serious Business

Serious Business Incorporated

Community
  • 1
  • 1
Tobias
  • 7,282
  • 6
  • 63
  • 85
  • Strikes me that you are trying to work out what the place names mean. Probably you need a simple parser to read the names. In real life often "small street, SE1" and "small street, E1" are often confused. I wouldn't expect an automated process to be perfect – Vorsprung Feb 03 '16 at 16:26

1 Answers1

0

I think using the raw distance metric will be hard. You probably want to use some NLP methods (nltk) to do ner (named entity recognition), then use that result to compare.

Gang Su
  • 1,187
  • 10
  • 12