Match items from two sets of data by highest % of similarities

Question

Task: I have two columns with product names. I need to find the most similar cell from Column B for Cell A1, then for A2, A3 and so on.

Input:

Col A | Col B
-------------
Red   | Blackwell
Black | Purple      
White | Whitewater     
Green | Reddit

Output:

Red = Reddit / 66% similar

Black = Blackwell / 71% similar

White = Whitewater / 66% similar

Green = Reddit / 30% similar

I think Levenstein Distance can help with sorting, but I don't know how to apply it.

Thanks in advance, any piece of information helps.

I've tried example one from this thread: http://php.net/manual/en/function.levenshtein.php It finds the closest string from array for input. But I don't know how to automate it, so instead of single input item it comparest arrays and finds matches. — Igor Gorbenko, Feb 16 '18 at 11:55
Please post your best coding attempt in your question via an Edit. — mickmackusa, Feb 16 '18 at 11:57

score 3 · Accepted Answer · answered Feb 16 '18 at 13:13

Using nested loops

<?php

// Arrays of words
$colA = ['Red', 'Black', 'White', 'Green'];
$colB = ['Blackwell', 'Purple', 'Whitewater', 'Reddit'];

// loop through words to find the closest
foreach ($colA as $a) {

    // Current max number of matches
    $maxMatches = -1;
    $bestMatch = '';

    foreach ($colB as $b) {

        // Calculate the number of matches
        $matches = similar_text($a, $b, $percent);

        if ($matches > $maxMatches) {

            // Found a better match, update
            $maxMatches = $matches;
            $bestMatch = $b;
            $matchPercentage = $percent;

        }

    }

    echo "$a = $bestMatch / " . 
        number_format($matchPercentage, 2) . 
        "% similar\n";
}

The first loop iterates through the elements of the first array, for each it initializes the best match found and the number of matching characters on that match.

The inner loop iterates through the array of possible matches looking for the best match, for each candidate it checks the similarities (you could use levenshtein here instead of similar_text but the later is convenient because it calculates the percentage for you), if the current word is a better match than the current best match that variable gets updated.

For each word in the outer loop we echo the best match found and the percentage. Format as desired.

I didn't know about the `percent` parameter. I'm going to update my answer. +1 for showing me something new. — mickmackusa, Feb 16 '18 at 13:38

mickmackusa · Answer 2 · 2018-02-16T13:44:01.223

I am not sure where you are deriving these desired percentages, so I'll just use the values that the php functions churn out and you can decide if you want to perform any calculations on them.

levenshtein() simply doesn't deliver the desired matches that you have requested in your question. I think you would be wiser to use similar_text().

Code: (Demo)

$arrayA=['Red','Black','White','Green'];
$arrayB=['Blackwell','Purple','Whitewater','Reddit'];

// similar text
foreach($arrayA as $a){
    $temp=array_combine($arrayB,array_map(function($v)use($a){similar_text($v,$a,$percent); return $percent;},$arrayB));  // generate assoc array of assessments
    arsort($temp);  // sort descending 
    $result[]="$a is most similar to ".key($temp)." (sim-score:".number_format(current($temp))."%)";  // access first key and value
}
var_export($result);

echo "\n--\n";
// levenstein doesn't offer the desired matching
foreach($arrayA as $a){
    $temp=array_combine($arrayB,array_map(function($v)use($a){return levenshtein($v,$a);},$arrayB));  // generate assoc array of assessments
    arsort($temp);  // sort descending 
    $result2[]="$a is most similar to ".key($temp)." (lev-score:".current($temp).")";  // access first key and value
}
var_export($result2);

Output:

array (
  0 => 'Red is most similar to Reddit (sim-score:67%)',
  1 => 'Black is most similar to Blackwell (sim-score:71%)',
  2 => 'White is most similar to Whitewater (sim-score:67%)',
  3 => 'Green is most similar to Purple (sim-score:36%)',
)
--
array (
  0 => 'Red is most similar to Whitewater (lev-score:9)',
  1 => 'Black is most similar to Whitewater (lev-score:9)',
  2 => 'White is most similar to Blackwell (lev-score:8)',
  3 => 'Green is most similar to Blackwell (lev-score:8)',
)

Thanks, I did trysimilar_text(), but experiments I ran showed better results with levenshtein(), I used it on items that are mos similar, like "Apple iPhone X 128 GB" / "Iphone X (128 GB)" But the way you worked on matching arrays is what I was looking for. Thanks! — Igor Gorbenko, Feb 16 '18 at 14:16

Match items from two sets of data by highest % of similarities

2 Answers2

Using nested loops