0

I am working on a project that contains a large number of names and addresses in its database. Names such as "John K Smith" and "Joe Smith", and addresses such "20 Theroad avenue" or "1345 Myplace st."

In this project once a user X enters the website, they will enter a name and address along with other details; the entered name and address is checked with what already exist in the database. If the name and address entered is similar enough with what exist in the database for user X, access is granted.

Instead of exact string matching I need to perform approximate string matching to make the login more convenient. (I know this is a security concert but there is also username/pass which are exact matched).

I am looking for a string matching algorithm that is suitable for names and addresses, in addition take into account acronyms, short forms and similar phrases such as 'ave' vs 'avenue' or 'mr' vs 'mr.' or 'street' vs 'avenue'.

I have so far looked at edit distance, jarowinkler, ngram(qgram), cosine similarity and phonetic approaches.

I thought maybe a hybrid approach with a custom normalization function (that does string replacement for shortforms/similar terms) is the way to go, but I am not certain yet.

This project eventually should work with other languages (Spanish and French), which may mean more custom text replacements.

Any help is appreciated in finding the most suitable algorithm(s) to match names and addresses with high accuracy (with minimum number of false positives).

neutral_sphere
  • 61
  • 1
  • 1
  • 7
  • In R, you should consider a recent package "stringDist". It has implementation of several algorithms for approximate string matching. – raj_k Nov 02 '14 at 20:21
  • Package name is stringdist (not stringDist). Sorry. – raj_k Nov 02 '14 at 22:20

0 Answers0