I am building an address matching module in R, where I would like to find a match of a list of inAddress
against a database of all addresses dbAddress
using R.
Let's say the address contains street number, street name, postal code, city
to be matched. There are certain matching rules I would like to consider, for example :
postal code should be an exact match
street number should be an exact match, unless not found, then consider fuzzy matching
- street name in general should be a fuzzy match, with perhaps priority of exact match stressed on first word if not found (to try and search in similar results such as Washington avenue, Washington street, Washington Rd. etc.)
Do you have any advise on the strategy and how to build it effectively ? Here's several of my thoughts so far :
- Put two list of address on data table. Perhaps indexes to aid performance ?
- Search first on postal code with hard
if
and limit those with postal match first - Then cascade the result to a fuzzy match of street name. Perhaps normalize the name first to return only keyword (stem and remove avenue, street, of, etc). But I'm afraid that makes me lose information. W Avenue is different from W Street.
- Cascade result again to street number fuzzy match.
I am concerned this will be a big performance hurt. Also, is there a way to speed up multiple address match at the same time ? Perhaps join on postal code first to avoid full search each time ? Parallelism ?
Any advice would be welcome. Thank you