3

Trying to compare two columns in GoogleSheets with this formula in Column C:

=if(A1=B1,"","Mismatch")

Works fine, but I'm getting a lot of false positives:

A. B C
MARY JO Mary Jo
JAY, TIM TIM JAY Mismatch
Sam Ron Sam Ron Mismatch
Jack *Ma Jack MA Mismatch

Any ideas how to work this?

player0
  • 124,011
  • 12
  • 67
  • 124
clucko87
  • 91
  • 6
  • 1
    Do you mean false negatives? are you expecting all of them to be matches? or are you expecting that only 2nd row is a mismatch? Please include a more descriptive list that will show the expected results of comparing the 2 columns. – NightEye Feb 04 '22 at 17:33

4 Answers4

5

This uses a score based approach to determine a match. You can determine what is/isn't a match based on that score:

enter image description here

Score Formula = getMatchScore(A1,B1)
Match Formula = if(C1<.7,"mismatch",)
function getMatchScore(strA, strB, ignoreCase=true) {
  strA = String(strA);
  strB = String(strB)
  const toLowerCase = ignoreCase ? str => str.toLowerCase() : str => str;
  const splitWords = str => str.split(/\b/);
  let [maxLenStr, minLenStr] = strA.length > strB.length ? [strA, strB] : [strB, strA]; 
  
  maxLenStr = toLowerCase(maxLenStr);
  minLenStr = toLowerCase(minLenStr);

  const maxLength = maxLenStr.length;
  const minLength = minLenStr.length;
  const lenScore = minLength / maxLength;

  const orderScore = Array.from(maxLenStr).reduce(
    (oldItem, nItem, index) => nItem === minLenStr[index] ? oldItem + 1 : oldItem, 0
  ) / maxLength;

  const maxKeyWords = splitWords(maxLenStr);
  const minKeyWords = splitWords(minLenStr);

  const keywordScore = minKeyWords.reduce(({ score, searchWord }, nItem) => {
    const newSearchWord = searchWord?.replace(new RegExp(nItem, ignoreCase ? 'i' : ''), '');
    score += searchWord.length != newSearchWord.length ? 1: 0;

    return { score, searchWord: newSearchWord };
  }, { score: 0, searchWord: maxLenStr }).score / minKeyWords.length;

  const sortedMaxLenStr = Array.from(maxKeyWords.sort().join(''));
  const sortedMinLenStr = Array.from(minKeyWords.sort().join(''));

  const charScore = sortedMaxLenStr.reduce((oldItem, nItem, index) => { 
    const surroundingChars = [sortedMinLenStr[index-1], sortedMinLenStr[index], sortedMinLenStr[index+1]]
    .filter(char => char != undefined);
    
    return surroundingChars.includes(nItem)? oldItem + 1 : oldItem
  }, 0) / maxLength;

  const score = (lenScore * .15) + (orderScore * .25) + (charScore * .25) + (keywordScore * .35);

  return score;
}
first last
  • 396
  • 2
  • 5
3

Implementing fuzzy matching via Google Sheets formula would be difficult. I would recommend using a custom formula for this one or a full blown script (both via Google Apps Script) if you want to populate all rows at once.

Custom Formula:

function fuzzyMatch(string1, string2) {
  string1 = string1.toLowerCase()
  string2 = string2.toLowerCase();
  var n = -1;

  for(i = 0; char = string2[i]; i++)
    if (!~(n = string1.indexOf(char, n + 1))) 
      return 'Mismatch';
};

What this does is compare if the 2nd string's characters order is found in the same order as the first string. See sample data below for the case where it will return mismatch.

Output:

output

Note:

  • Last row is a mismatch as 2nd string have r in it that isn't found at the first string thus correct order is not met.
  • If this didn't meet your test cases, add a more definitive list that will show the expected output of the formula/function so this can be adjusted, or see player0's answer which solely uses Google Sheets formula and is less stricter with the conditions.

Reference:

NightEye
  • 10,634
  • 2
  • 5
  • 24
3

try:

=ARRAYFORMULA(IFERROR(IF(LEN(
 REGEXREPLACE(REGEXREPLACE(LOWER(A1:A), "[^a-z ]", ), 
 LOWER("["&B1:B&"]"), ))>0, "mismatch", )))

enter image description here

player0
  • 124,011
  • 12
  • 67
  • 124
0

The main limitation of traditional fuzzy matching is that it doesn’t take into consideration similarities outside of the strings. Topic clustering requires semantic understanding. Goodlookup is a smart function for spreadsheet users that gets very close to semantic understanding. It’s a pre-trained model that has the intuition of GPT-3 and the join capabilities of fuzzy matching. Use it like vlookup or index match to speed up your topic clustering work in google sheets.

https://www.goodlookup.com/

Justin
  • 1