Fuzzy Matching Addresses

Question

I am busy writing a simple algorithm to fuzzy match addresses from two datasets. I am calculating the levenshtein distance between two addresses and then adding the exact match or the shortest match to a matched array.

However this is very slow as in worst case it has to compare each old address to each new address.

My current solution is as follows:

matches = [];
foreach ($classifications as $classification)
{
    $classification = $stringMatchingService->standardize($classification, $stringMatchingService->isClassification());
    $shortest = -1;
    $closest = '';
    $cnt = 0;
    foreach ($lines as $line)
    {
        $line = $stringMatchingService->standardize($line, $stringMatchingService->isRTT());
        if ($classification[CLASSIFICATION_POSTCODE] != $line[RTT_POSTCODE]) {
            continue;
        }

        $lev = levenshtein($classification[CLASSIFICATION_SUBURB], $line[RTT_SUBURB]);
    if ($lev == 0) {
        $matches[$classification[CLASSIFICATION_SUBURB]] = $line[RTT_SUBURB];
        $cnt++;
        break;
    }

    if ($lev <= $shortest || $shortest < 0) {
        //set the closest match and distance
        $closest = $line[RTT_SUBURB];
        $shortest = $lev;
    }

    if ($cnt == count($lines)) {
        $matches[$classification[CLASSIFICATION_SUBURB]] = $closest;
    }
    $cnt++;
}

}
print_r(count($matches));

Note that the standardize function simply attempts to standardize the addresses by removing irrelevant information and padding postcodes.

I am wondering however as to how to speed this up as at the moment it is very expensive or alternatively if there is a better approach to take?

Any help is appreciated,

Thanks!

EDIT: The size of $classifications is 12000 rows and the size of $lines is 17000 rows. The standardize function is as follows:

 public function standardize($line, $dataSet)
{
    switch ($dataSet) {
        case self::CLASSIFICATIONS:
            if (!isset($line[9], $line[10]) || empty($line[9]) || empty($line[10])) {
                continue;
            }
            $suburb = $line[9];
            $suburb = strtoupper($suburb);
            $suburb = str_replace('EXT', '', $suburb);
            $suburb = str_replace('UIT', '', $suburb);
            $suburb = preg_replace('/[0-9]+/', '', $suburb);

            $postCode = $line[10];
            $postCode = str_pad($postCode, 4,'0', STR_PAD_LEFT);
            $line[9] = $suburb;
            $line[10] = $postCode;
            return $line;
        case self::RTT:
            if (!isset($line[1], $line[0]) || empty($line[1]) || empty($line[0])) {
                continue;
            }
            $suburb = $line[1];
            $suburb = strtoupper($suburb);
            $suburb = str_replace('EXT', '', $suburb);
            $suburb = str_replace('UIT', '', $suburb);
            $suburb = preg_replace('/[0-9]+/', '', $suburb);
            $postCode = $line[0];
            $postCode = str_pad($postCode, 4,'0', STR_PAD_LEFT);
            $line[1] = $suburb;
            $line[0] = $postCode;
            return $line;
    }

It just aims to access the data appropriately and remove certain keywords and pad the post codes if it is not in format XXXX.

What's the size of `$classifications` ? May you add the `standardize` method to an edit. Please, paste an example of data — ceadreak, Apr 14 '16 at 18:48

score 1 · Answer 1 · answered Apr 14 '16 at 19:11

Problem here is for each $classifications line, you check if a line match in $line. = 12000 * 17000 ...

So, I don't know the structure of your arrays, but you can imagine to use array_filter.

$matches = array_filter($classifications, function ($entry) use ($lines) {

    foreach ($lines as $line)
    {
        $lev = levenshtein($entry[CLASSIFICATION_SUBURB], $line[RTT_SUBURB]);

        // if match, return true
    }

});

$matches will be an array of matched lines.

It depends of your data structure, but the better way is to use array_merge coupled with array_unique

score 0 · Answer 2 · edited Oct 02 '20 at 09:35

What tolerance did you use for the levenshtein distance algorithm? In my experience, less than 0.8 would return far too many false matches. I ended up using manual corrections for short words such as raod = road otherwise the score would be 1 char wrong, making it a 75% match. I found an article for 12 tests to find addresses using fuzzy matching that could be useful for improving your algorithm. The examples include:

Spelling Mistakes
Missing Space
Incorrect Type (Street vs Road)
Bordering / Nearby Suburb
Abbreviations
Synonyms: Floor vs Level
Unit, Flat or Apartment vs Letter
Number vs Letter
Extra Words (e.g. Front Door, Department Name)
Swapped Letters
Sounds Like
Tokenisation (Different Input Order

Fuzzy Matching Addresses

2 Answers2