algorithm to find most matching word from two strings

Question

I have 2 strings and my goal is to match as many words as possible.

The problem is that the 2 strings are similar but not equal (for example one of the string have a missing word, or a word misspell)

example:

var str1 = "I like this soup because it is very tasty, like the one that my grandma used to make";
var str2 = "I really lie this soup, it is very tasty, like the one that my grandma use to make";

In this case 'str1' is the correct string so I have to match as many words as possible from 'str2' that contain "really" that is unnecessary and "like" that is misspelled as "lie".

Now, an easy solution could be to check every word like this

var split1 = str1.split(/[\s,]+/);
var split2 = str2.split(/[\s,]+/);
var i , j = 0;
var found;
for(i = 0 ; i < split1.length ; i++){
   found=false;
   for( ; j < split2.length && !found; j++){
      if(split1[i]==split2[j]){
         found=true;
         //do something here
      }
   }
}

but there is actually a big problem: the second "like" from str2 could be matched with the first "like" from str1.

lastly the goal if the algorithm is to match as many word as possible, and if I can't find a match go on with the algorithm.

I'd recommend looking into fuzzy matching for this. If the difference between the strings is arbitrary, it's unlikely you'll be able to solve this without. Python has some good fuzzy matching libraries - look into fuzzywuzzy https://pypi.org/project/fuzzywuzzy/ — bm13563, Aug 27 '20 at 08:43
Another option (although per Nina's ask, am not sure of the desired result) is patiencediffplus, which also respects the order of the words when performing a comparison. See https://stackoverflow.com/questions/57102484/find-difference-between-two-strings-in-javascript/57103581#57103581 . Note that you'll have to prep the sentences by removing the punctuation from the sentences, likely standardizing the case (eg, set all to lowercase), and then splitting on spaces prior to calling this function. — Trentium, Aug 27 '20 at 17:47

Ashen Gunaratne · Answer 1 · 2020-08-27T09:25:23.953

If I grasped your requirement comprehensively, the object freqC should satisfy your need.

const str1 = 'I like this soup because it is very tasty, like the one that my grandma used to make';
const str2 = 'I really lie this soup, it is very tasty, like the one that my grandma use to make';

const freq1 = str1.toLowerCase().split(' ').reduce((accumulator, key) => 
  Object.assign(accumulator, { [key]: (accumulator[key] || 0) + 1 }), {});

const freqC = str2.toLowerCase().split(' ').reduce((accumulator, key) => 
  Object.assign(accumulator, { [key]: (accumulator[key] || 0) + 1 }), freq1);
  
console.info(freqC);

Please update the post with expected results if the above solution doesn't suit your requirement.

score 0 · Answer 2 · answered Aug 27 '20 at 14:07

You could apply Levenshtein distance to check whether two words are similar. I will denote it as Ld subsequently. Now, if you know that str1 is correct, then you can do as follows:

function getMostMatches(correct, interesting, limit, cIndex, iIndex) {
    if (!cIndex) cIndex = 0;
    if (!iIndex) iIndex = 0;
    var maxScore = 0;
    while (cIndex < correct.length) {
        while (iIndex < interesting.length) {
            if (Ld(correct[cIndex], interesting[iIndex]) < limit) {
                var score = 1 + getMostMatches(correct, interesting, limit, cIndex + 1, iIndex + 1);
                if (score > maxScore) maxScore = score;
            }
            iIndex = iIndex + 1;
        }
        cIndex++;
    }
    return maxScore;
}

var correct = str1.split(" ");
var interesting = str2.split(" ");

score 0 · Answer 3 · answered Aug 28 '20 at 22:05

To expand on my earlier comment, using the patienceDiff / patienceDiffPlus algorithm ( see https://github.com/jonTrent/PatienceDiff ) might be a good fit for your situation, as the patienceDiff algorithm is generally good for highlighting the deltas between two strings that are very similar with only some minor differences. The algorithm in your case can be used as follows, with the first step to remove the commas and split the sentences into arrays of words...

var str1 = "I like this soup because it is very tasty, like the one that my grandma used to make";
var str2 = "I really lie this soup, it is very tasty, like the one that my grandma use to make";

let a = str1.split( ',' ).join( '' ).split( ' ');
let b = str2.split( ',' ).join( '' ).split( ' ');
let pdp = patienceDiffPlus( a, b )

console.log( pdp );

...results in...

Object
  lineCountDeleted: 3
  lineCountInserted: 3
  lineCountMoved: 0
  lines: Array(21)
    0: {line: "I", aIndex: 0, bIndex: 0}
    1: {line: "like", aIndex: 1, bIndex: -1}
    2: {line: "really", aIndex: -1, bIndex: 1}
    3: {line: "lie", aIndex: -1, bIndex: 2}
    4: {line: "this", aIndex: 2, bIndex: 3}
    5: {line: "soup", aIndex: 3, bIndex: 4}
    6: {line: "because", aIndex: 4, bIndex: -1}
    7: {line: "it", aIndex: 5, bIndex: 5}
    8: {line: "is", aIndex: 6, bIndex: 6}
    9: {line: "very", aIndex: 7, bIndex: 7}
    10: {line: "tasty", aIndex: 8, bIndex: 8}
    11: {line: "like", aIndex: 9, bIndex: 9}
    12: {line: "the", aIndex: 10, bIndex: 10}
    13: {line: "one", aIndex: 11, bIndex: 11}
    14: {line: "that", aIndex: 12, bIndex: 12}
    15: {line: "my", aIndex: 13, bIndex: 13}
    16: {line: "grandma", aIndex: 14, bIndex: 14}
    17: {line: "used", aIndex: 15, bIndex: -1}
    18: {line: "use", aIndex: -1, bIndex: 15}
    19: {line: "to", aIndex: 16, bIndex: 16}
    20: {line: "make", aIndex: 17, bIndex: 17}
    length: 21

...where:

If aIndex = -1 then the a array did not have a corresponding value in the b array.
If bIndex = -1 then the b array did not have a corresponding value in the a array.
If aIndex and bIndex are both positive, then a match was found at the corresponding indexes of the arrays.

Also note that if you perform a patienceDiff character-by-character, that is, splitting the sentences into arrays of characters...

let a = str1.split( '' );
let a = str2.split( '' );
let pdp = patienceDiff( a, b )

console.log( pdp );

...then the result will be...

0: {line: "I", aIndex: 0, bIndex: 0}
1: {line: " ", aIndex: 1, bIndex: 1}
2: {line: "r", aIndex: -1, bIndex: 2}
3: {line: "e", aIndex: -1, bIndex: 3}
4: {line: "a", aIndex: -1, bIndex: 4}
5: {line: "l", aIndex: -1, bIndex: 5}
6: {line: "l", aIndex: -1, bIndex: 6}
7: {line: "y", aIndex: -1, bIndex: 7}
8: {line: " ", aIndex: -1, bIndex: 8}
9: {line: "l", aIndex: 2, bIndex: 9}
10: {line: "i", aIndex: 3, bIndex: 10}
11: {line: "k", aIndex: 4, bIndex: -1}
12: {line: "e", aIndex: 5, bIndex: 11}
13: {line: " ", aIndex: 6, bIndex: 12}
14: {line: "t", aIndex: 7, bIndex: 13}
15: {line: "h", aIndex: 8, bIndex: 14}
16: {line: "i", aIndex: 9, bIndex: 15}
17: {line: "s", aIndex: 10, bIndex: 16}
18: {line: " ", aIndex: 11, bIndex: 17}
      o
      o
      o
84: {line: " ", aIndex: 76, bIndex: 74}
85: {line: "t", aIndex: 77, bIndex: 75}
86: {line: "o", aIndex: 78, bIndex: 76}
87: {line: " ", aIndex: 79, bIndex: 77}
88: {line: "m", aIndex: 80, bIndex: 78}
89: {line: "a", aIndex: 81, bIndex: 79}
90: {line: "k", aIndex: 82, bIndex: 80}
91: {line: "e", aIndex: 83, bIndex: 81}

...which shows the addition of the word 'really' in the b array, and also that the 'k' is missing in the b array within the word like. Employing the patienceDiff algorithm character-by-character might suit your needs better, depending on the level with which you wish to match the words.

algorithm to find most matching word from two strings

3 Answers3