0

I am busy writing a simple algorithm to fuzzy match addresses from two datasets. I am calculating the levenshtein distance between two addresses and then adding the exact match or the shortest match to a matched array.

However this is very slow as in worst case it has to compare each old address to each new address.

My current solution is as follows:

matches = [];
foreach ($classifications as $classification)
{
    $classification = $stringMatchingService->standardize($classification, $stringMatchingService->isClassification());
    $shortest = -1;
    $closest = '';
    $cnt = 0;
    foreach ($lines as $line)
    {
        $line = $stringMatchingService->standardize($line, $stringMatchingService->isRTT());
        if ($classification[CLASSIFICATION_POSTCODE] != $line[RTT_POSTCODE]) {
            continue;
        }

        $lev = levenshtein($classification[CLASSIFICATION_SUBURB], $line[RTT_SUBURB]);
    if ($lev == 0) {
        $matches[$classification[CLASSIFICATION_SUBURB]] = $line[RTT_SUBURB];
        $cnt++;
        break;
    }

    if ($lev <= $shortest || $shortest < 0) {
        //set the closest match and distance
        $closest = $line[RTT_SUBURB];
        $shortest = $lev;
    }

    if ($cnt == count($lines)) {
        $matches[$classification[CLASSIFICATION_SUBURB]] = $closest;
    }
    $cnt++;
}

}
print_r(count($matches));

Note that the standardize function simply attempts to standardize the addresses by removing irrelevant information and padding postcodes.

I am wondering however as to how to speed this up as at the moment it is very expensive or alternatively if there is a better approach to take?

Any help is appreciated,

Thanks!

EDIT: The size of $classifications is 12000 rows and the size of $lines is 17000 rows. The standardize function is as follows:

 public function standardize($line, $dataSet)
{
    switch ($dataSet) {
        case self::CLASSIFICATIONS:
            if (!isset($line[9], $line[10]) || empty($line[9]) || empty($line[10])) {
                continue;
            }
            $suburb = $line[9];
            $suburb = strtoupper($suburb);
            $suburb = str_replace('EXT', '', $suburb);
            $suburb = str_replace('UIT', '', $suburb);
            $suburb = preg_replace('/[0-9]+/', '', $suburb);

            $postCode = $line[10];
            $postCode = str_pad($postCode, 4,'0', STR_PAD_LEFT);
            $line[9] = $suburb;
            $line[10] = $postCode;
            return $line;
        case self::RTT:
            if (!isset($line[1], $line[0]) || empty($line[1]) || empty($line[0])) {
                continue;
            }
            $suburb = $line[1];
            $suburb = strtoupper($suburb);
            $suburb = str_replace('EXT', '', $suburb);
            $suburb = str_replace('UIT', '', $suburb);
            $suburb = preg_replace('/[0-9]+/', '', $suburb);
            $postCode = $line[0];
            $postCode = str_pad($postCode, 4,'0', STR_PAD_LEFT);
            $line[1] = $suburb;
            $line[0] = $postCode;
            return $line;
    }

It just aims to access the data appropriately and remove certain keywords and pad the post codes if it is not in format XXXX.

liamjnorman
  • 784
  • 1
  • 16
  • 30

2 Answers2

1

Problem here is for each $classifications line, you check if a line match in $line. = 12000 * 17000 ...

So, I don't know the structure of your arrays, but you can imagine to use array_filter.

$matches = array_filter($classifications, function ($entry) use ($lines) {

    foreach ($lines as $line)
    {
        $lev = levenshtein($entry[CLASSIFICATION_SUBURB], $line[RTT_SUBURB]);

        // if match, return true
    }

});

$matches will be an array of matched lines.

It depends of your data structure, but the better way is to use array_merge coupled with array_unique

ceadreak
  • 1,662
  • 2
  • 18
  • 27
0

What tolerance did you use for the levenshtein distance algorithm? In my experience, less than 0.8 would return far too many false matches. I ended up using manual corrections for short words such as raod = road otherwise the score would be 1 char wrong, making it a 75% match. I found an article for 12 tests to find addresses using fuzzy matching that could be useful for improving your algorithm. The examples include:

  1. Spelling Mistakes
  2. Missing Space
  3. Incorrect Type (Street vs Road)
  4. Bordering / Nearby Suburb
  5. Abbreviations
  6. Synonyms: Floor vs Level
  7. Unit, Flat or Apartment vs Letter
  8. Number vs Letter
  9. Extra Words (e.g. Front Door, Department Name)
  10. Swapped Letters
  11. Sounds Like
  12. Tokenisation (Different Input Order
Jsowa
  • 9,104
  • 5
  • 56
  • 60
Strydom
  • 850
  • 7
  • 6