2

I am building an address matching module in R, where I would like to find a match of a list of inAddress against a database of all addresses dbAddress using R.

Let's say the address contains street number, street name, postal code, city to be matched. There are certain matching rules I would like to consider, for example :

  • postal code should be an exact match

  • street number should be an exact match, unless not found, then consider fuzzy matching

  • street name in general should be a fuzzy match, with perhaps priority of exact match stressed on first word if not found (to try and search in similar results such as Washington avenue, Washington street, Washington Rd. etc.)

Do you have any advise on the strategy and how to build it effectively ? Here's several of my thoughts so far :

  • Put two list of address on data table. Perhaps indexes to aid performance ?
  • Search first on postal code with hard if and limit those with postal match first
  • Then cascade the result to a fuzzy match of street name. Perhaps normalize the name first to return only keyword (stem and remove avenue, street, of, etc). But I'm afraid that makes me lose information. W Avenue is different from W Street.
  • Cascade result again to street number fuzzy match.

I am concerned this will be a big performance hurt. Also, is there a way to speed up multiple address match at the same time ? Perhaps join on postal code first to avoid full search each time ? Parallelism ?

Any advice would be welcome. Thank you

Kenny
  • 1,902
  • 6
  • 32
  • 61
  • Think about which values are more likely to have data entry errors and which are not. Primary number is on the far left and is the first thing that is entered and almost never changes. Count on that one highly. Street names are frequently misspelled. Counting on the full street name is less reliable - first couple of letters, sure. City + State and zipcode are synonyms. A zipcode represents a various city+state combinations (and the zipcode is subject to change by the USPS). I would recommend a simple search of primary number, partial street, city+state. – Jeffrey Oct 27 '17 at 20:04
  • @Jeffrey : what would be your search order ? Number does not seem to limit our search result at first step, zipcode however does the job. Partial street will miss Ave, Street, boulevard, etc. How would you clean/match the Street name ? – Kenny Oct 30 '17 at 11:52
  • Can you contact me at support@smartystreets.com - I'd be happy to discuss this further but it would be helpful to get on a phone call, I think. I would recommend house number and zipcode first, followed by one, two, or three characters of the street name. Where are you obtaining a master address list to index? – Jeffrey Oct 30 '17 at 17:43

1 Answers1

2

The levensthein is a must for simple spelling mistakes. Finding the right tolerance is important because less than 0.8 would return too many false positives.

I’d recommend using a dictionary of short words that you can correct too, such as road/raod or street/stret.

You may want to check for abbreviations such as Ave vs Avenue, which starts with the same characters however Road vs Rd is missing some characters so the matching rules are different. Once again, a dictionary could help.

This article contains 12 tests to find addresses using fuzzy matching that could be useful for improving your algorithm. Many of these examples Google can’t even match!

The examples include:

  1. Spelling Mistakes
  2. Missing Space
  3. Incorrect Type (Street vs Road)

  4. Bordering / Nearby Suburb

  5. Abbreviations
  6. Synonyms: Floor vs Level
  7. Unit, Flat or Apartment vs Letter
  8. Number vs Letter
  9. Extra Words (e.g. Front Door, Department Name)
  10. Swapped Letters
  11. Sounds Like
  12. Tokenisation (Different Input Order)

After looking at several commercial address autocomplete widgets, this one (https://www.addy.co.nz/address-finder-fuzzy-matching) is by far the smartest for New Zealand addresses. Perhaps you can get inspiration and come up with an even better algorithm!

Strydom
  • 850
  • 7
  • 6