2

I have a list of 500,000 accounts each with an address.

What I need to do is clean up the addresses so I can find out which accounts falls in the same address. I understand there are going to be some errors but I would need to do the best I can.

Is there a service I can use where it can validate the address or it would format the address to a standard format?

for example

RD->ROAD
ST->Street
STE->SUITE
.....
....

I don't know all the combinations. Is there a script that I can run that will update the addresses?

I know I can use MySQL REPLACE() function to replace RD with ROAD but what if the steet name itself contact the words 'RD' that will cause an issue? if I was to do a replace it would have to be a word replace not a standard string replace.

I could write a PHP script that will update the addresses (something like below) but I would need to know all possible combinations.

What is the best approach for this problem? How can I find out all/most the combinations that I would have to check for?

<?php

$arr = explode(" ", $row['address']);
$clean = array();
foreach($arr AS $key=>$val){

if($val == 'RD')
$new = 'ROAD';
else if ($val == 'STE')
$new = 'SUITE';
else 
$new = $val;

$clean[] = $new;
}
?>
Jaylen
  • 39,043
  • 40
  • 128
  • 221

1 Answers1

4

This is actually really hard (trust me, I know -- I've been writing scripts like this for years at SmartyStreets). It's even harder when the entire address is a single string. What you need is an address verification service that supports freeform parsing. (SmartyStreets does.)

Google Maps API's Terms of Service forbids using it for bulk amounts of data or storing the results for reasons other than temporary caching. Further, it does not verify addresses -- an address can be invalid and still return a geocode with that API.

You can do your own research to find a service that meets your needs. I'm biased, but I believe SmartyStreets will meet your needs most economically. The API is guaranteed by an SLA to be up, it's easy to use (just make a POST request with up to 100 addresses), and you can store the data as long as you need (though you should probably check for data updates every 90-120 days, depending on how rigorously correct you need your addresses to be).

Services that are CASS-Certified can also standardize addresses into the proper format, making it trivial to de-duplicate your data (i.e. merge the same addresses that are represented differently).

Matt
  • 22,721
  • 17
  • 71
  • 112