I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that they often have prefixes and suffixes which are common to the country they are in.
For example:
Boulleville vs. Boscherville
These are almost certainly different cities. However, because they both end with "ville" (and both begin with "Bo") they have a rather small Levenstein distance.
*I am looking for a string distance algorithm that takes into account the position of the character to minimize the effect of prefixes and suffixes by weighting letters in the middle of the word higher than letters at the ends of the word. *
I could probably write something myself but I would find it hard to believe that no one has yet published a suitable algorithm.