4

I need to compare two unstructured addresses and be able to identify if they are the same (or similar enough).

Scenario

  • Address is supplied by the end user in plain text.
  • There is nothing to help the user to write on a more identifiable manner (no autocomplete, nothing. Just an empty textbox).
  • "#102 Nice-Looking Street, Gotham City, NY" should match with "Nice Loking St., Gotham City, New York, apt 102".
  • Using a third-party service is not an option.
  • Search is not a problem. I already have the two strings. What I need is to check if they represent the same address, despite its differences on structure.

What I have found

I know we can use some Fuzzy logic for this kind of comparison, with some tolerance for misspelling, but...

  • There are some keywords (like, for instance, comparing "Street" to "St." or comparing "#102" to "apt 102", or "NY" to "New York") that are not supposed to penalize the degree of reliability.
  • Some words can be placed in different order (like the appartement in the above example).

I do not want to reinvent the Wheel. This problem seems like a common concern in different contexts and I think there is an algorithm (with some slight modifications, maybe) that might be a fit for this scenario.

Thanks in advance

Minduca
  • 1,121
  • 9
  • 19
  • 3
    Well, you could pass both addresses to Google or another mapping API, get back the co-ordinates of where it thinks each address is and then do some maths to find out how far apart they are, but this is using a 3rd party API. Other than that, the fact that they are addresses is almost irrelevant - it's just a fuzzy string-matching problem, simplified slightly by using common substitutions as you mentioned, which you could store in a database of some sort (Street/St, Washington/D.C./DC, New York/NY etc) – Steve Ives May 27 '16 at 07:24

1 Answers1

5

I've helped build some open source tools to do this.

Basically, the approach is to try to split and address into it's constituent parts and then intelligently compare those parts.

Both parts of the problem are hard.

The first part is often called address parsing. Here's what we use: https://github.com/datamade/usaddress

The second part has many, many names but, let's call it fuzzy matching. Here's the library we made for that: https://github.com/datamade/dedupe

We also provided some facilities for using them together: http://dedupe.readthedocs.io/en/latest/Variable-definition.html#address-type

fgregg
  • 3,173
  • 30
  • 37