11

I'm using the Levenshtein Distance algorithm in C++ to compare two strings to measure how close they are to each other. However, the plain Levenshtein Distance algorithm does not distinguish word boundaries as delimited by spaces. This results in smaller distance calculations than I want. I'm comparing titles to see how close they are to each other and I wish for the algorithm to not count characters as matching if they come from across multiple words.

For example, if I compare these two strings I get the following result with + designating a match and - designating a non-match:

Al Chertoff Et
Al Church Department of finance Et
+++++------+--++-----++-+------+++
Al Ch      e  rt     of f       Et

I get a get a distance of 20 with the word "Chertoff" matching across the four words "Church Department of finance" whereas, I really want them to be considered further apart from each other by not allowing characters to match from more than one word and getting a distance of 25 with the word "Chertoff" most matching the one word "Department", with three characters matching:

Al Chertoff Et
Al Church Department of finance Et
+++--------+--++---------------+++
Al         e  rt                Et
         Ch     off

How could I adapt the Levenshtein Distance to accomplish this or is there another distance algorithm that would be better suited for this? Perhaps using the Levenshtein distance on each word individually word work and choosing the word with the least distance? However, what if matching one word well deep into the string caused the subsequent words to match poorly because their matches were best earlier in the string? Could this somehow be done with Levenshtein distance adapted to be at a word level?

For example, the shortest distance by this idea for the following more complicated example is 20:

Al Chertoff Deport Et
Al Church Department of finance Et
+++++----++++-++---------------+++
Al Ch     Dep rt                Et
     ertoff  o

Instead of maximizing "Chertoff"'s match and getting the longer distance of 24:

Al Chertoff Deport Et
Al Church Department of finance Et
+++--------+--++-----+---------+++
Al         e  rt     o          Et
         Ch     off
                  Dep rt

My current implementation of the Levenshtein Distance is as follows:

size_t
levenshtein_distance(const std::string& a_compare1,
                     const std::string& a_compare2) {
  const size_t length1 = a_compare1.size();
  const size_t length2 = a_compare2.size();
  std::vector<size_t> curr_col(length2 + 1);
  std::vector<size_t> prev_col(length2 + 1);

  // Prime the previous column for use in the following loop:
  for (size_t idx2 = 0; idx2 < length2 + 1; ++idx2) {
    prev_col[idx2] = idx2;
  }

  for (size_t idx1 = 0; idx1 < length1; ++idx1) {
    curr_col[0] = idx1 + 1;

    for (size_t idx2 = 0; idx2 < length2; ++idx2) {
      const size_t compare = a_compare1[idx1] == a_compare2[idx2] ? 0 : 1;

      curr_col[idx2 + 1] = std::min(std::min(curr_col[idx2] + 1,
                                             prev_col[idx2 + 1] + 1),
                                    prev_col[idx2] + compare);
    }

    curr_col.swap(prev_col);
  }

  return prev_col[length2];
}
WilliamKF
  • 41,123
  • 68
  • 193
  • 295
  • "However, what if matching one word well deep into the string caused the subsequent words to match poorly because their matches were best earlier in the string" Can you explain that statement with an example? Give an example of matching that you like. – ElKamina Mar 14 '13 at 18:17
  • That is shown by the second example `"Al Church Department of finance Et"`. – WilliamKF Mar 14 '13 at 18:18
  • It’s a bit late for me but couldn’t you just split by words and match words separately? You’d in effect have two edit distance calculations: one on the level of words, and one (inside the first one) on the level of letters within two words (only if the words don’t match). /EDIT This sounds suspiciously similar to Stephen’s answer. – Konrad Rudolph Mar 14 '13 at 23:44
  • @KonradRudolph yeah that's basically it, you just have to model the insertion, deletion, and replacement costs for words within sentences properly (since for letters within words they are always 1, and the algorithm is usually written with that assumption) – Stephen Lin Mar 15 '13 at 01:04

2 Answers2

6

I can get pretty close to what you want by making levenshtein_distance a generic algorithm on a sequence container and including a cost function that calculates the distance between two elements:

template<typename T, typename C>
size_t
seq_distance(const T& seq1, const T& seq2, const C& cost,
             const typename T::value_type& empty = typename T::value_type()) {
  const size_t size1 = seq1.size();
  const size_t size2 = seq2.size();

  std::vector<size_t> curr_col(size2 + 1);
  std::vector<size_t> prev_col(size2 + 1);

  // Prime the previous column for use in the following loop:
  prev_col[0] = 0;
  for (size_t idx2 = 0; idx2 < size2; ++idx2) {
    prev_col[idx2 + 1] = prev_col[idx2] + cost(empty, seq2[idx2]);
  }

  for (size_t idx1 = 0; idx1 < size1; ++idx1) {
    curr_col[0] = curr_col[0] + cost(seq1[idx1], empty);

    for (size_t idx2 = 0; idx2 < size2; ++idx2) {
      curr_col[idx2 + 1] = std::min(std::min(
        curr_col[idx2] + cost(empty, seq2[idx2]),
        prev_col[idx2 + 1] + cost(seq1[idx1], empty)),
        prev_col[idx2] + cost(seq1[idx1], seq2[idx2]));
    }

    curr_col.swap(prev_col);
    curr_col[0] = prev_col[0];
  }

  return prev_col[size2];
}

Given the above seq_distance, the edit distance between two sentences such that edits can not be made between word boundaries, can be defined with the following:

size_t
letter_distance(char letter1, char letter2) {
  return letter1 != letter2 ? 1 : 0;
}

size_t
word_distance(const std::string& word1, const std::string& word2) {
  return seq_distance(word1, word2, &letter_distance);
}

size_t
sentence_distance(const std::string& sentence1, const std::string& sentence2) {
  std::vector<std::string> words1;
  std::vector<std::string> words2;
  std::istringstream iss1(sentence1);
  std::istringstream iss2(sentence2);
  std::copy(std::istream_iterator<std::string>(iss1),
            std::istream_iterator<std::string>(),
            std::back_inserter(words1));
  std::copy(std::istream_iterator<std::string>(iss2),
            std::istream_iterator<std::string>(),
            std::back_inserter(words2));
  return seq_distance(words1, words2, &word_distance);
}

Here's the code working on ideone. I've tested a few cases and I'm pretty sure it does the right thing, but you should try it out more to make sure the results are reasonable.

Note that this isn't exactly what you asked for, since it ignores all spaces in the edit distance measurement: I think it shouldn't be too hard to modify it not to do that, but I haven't thought it through completely. In any case, this might be just as good (or even better), depending on your needs, so I'll let you decide if you want to try to tweak it.

Just a minor note, your original code was slightly buggy in that the following two lines:

curr_col.reserve(length2 + 1);
prev_col.reserve(length2 + 1);

reserve capacity in the vectors, but do not actually change the sizes of them, so accessing the array after that was undefined behavior. You should actually resize the vector if you're going to access elements in a range: reserve is usually for situations where you are about to push_back a certain number of elements one-by-one (which increases the size as you go, not all at once) and you want to avoid the cost of multiple internal reallocations (since the internal capacity only increases by a certain factor each time the capacity is exceeded).

EDIT:

This version takes into consideration spaces between words as part of the edit distance, but the results are still not exactly the same as your examples because of the requirement to add multiple spaces in some cases.

Stephen Lin
  • 5,470
  • 26
  • 48
  • Thanks for catching the bug! – WilliamKF Mar 14 '13 at 22:37
  • This is exactly what I was trying myself but I was getting caught up on what to replace the +1 with for `curr_col[idx2]` and `curr_col[idx2 + 1]`. How did you come to the conclusion that `seq2[idx2]` and `seq1[idx1]` were the correct corresponding values to use? – WilliamKF Mar 14 '13 at 22:41
  • it's `cost(empty, seq2[idx2])`, the distance between an empty element and `seq2[idx2]`, in one case, and `cost(seq1[idx1], empty))`, the distance between an empty element and `seq1[idx1]` in the other (basically string length in the case of words)...because these terms represent insertion and deletion of entire elements (i.e. words). I had to experiment to figure out which was which, actually, but there were only two choices. – Stephen Lin Mar 14 '13 at 22:44
  • (`cost(seq1[idx1], seq2[idx2]))` is for replacing one word with another...still unclear the direction but it's symmetric anyway) – Stephen Lin Mar 14 '13 at 22:46
  • For my usage, excluding spaces is beneficial. – WilliamKF Mar 14 '13 at 22:47
  • ok, then use the first version then...but try it out some more just in case. it seems to have the same results when you reverse `sentence1` and `sentence2`, so I think it's working correctly. – Stephen Lin Mar 14 '13 at 22:49
-2

Word boundaries will be crossed if the individual words are not of the same length. If you want to keep the indices being compared within the respective words then you will need to make the words of the same length. For example, here's a Javascript (yes, I know you asked or C++ but this is for illustration - code taken from Wikipedia) distance calculation routine:

var memo = {};

function d(str1, i, len1, str2, j, len2){
    var key = [i,len1,j,len2].join(',');
    if(memo[key] != undefined) return memo[key];

    if(len1 == 0) return len2;
    if(len2 == 0) return len1;
    var cost = 0;
    if(str1[i] != str2[j]) cost = 1;

    var dist = Math.min(
        d(str1, i+1,len1-1, str2,j,len2)+1, 
        d(str1,i,len1,str2,j+1,len2-1)+1,
        d(str1,i+1,len1-1,str2,j+1,len2-1)+cost);
    memo[key] = dist;
    return dist;
}

var str1 = "Al Chertoff Deport$$$$ $$ $$$$$$$ Et";
var str2 = "Al Church$$ Department of finance Et";

console.log(d(str1, 0, str1.length, str2, 0, str2.length));

Notice how I have modified the two input strings to match at individual word level. Running this I got a distance of 19. Similarly if I change the strings to:

var str1 = "Al Chertoff $$$$$$$$$$ $$ $$$$$$$ Et";
var str2 = "Al Church$$ Department of finance Et";

I get a distance of 24.

Satyajit
  • 3,839
  • 1
  • 19
  • 14
  • 2
    But how do you choose whether to put the `$`(s) before or after the final `Et`? (e.g. `"Al Chertoff Et$$$$$$$$ $$ $$$$$$$ $$"`) – WilliamKF Mar 14 '13 at 19:21
  • 1
    @Satyajit, this basically requires knowing the answer before running the algorithm – Stephen Lin Mar 14 '13 at 19:29
  • I wasn't saying that's what the OP should do, I was just making the point that preventing matches across word boundaries needs the words to be of the same length. You could also argue that with same length strings we are reducing this to Hamming distance calculation and not edit distance necessarily. – Satyajit Mar 14 '13 at 19:36
  • @Satyajit it doesn't answer the question, and doesn't even reduce its complexity; there's also simpler ways of preventing matches across word boundaries (just tag each word with a numerical word index and don't let letters with mismatching word indexes compare equal) but if you look at his question, it's clear that he wants this to handle words being added or subtracted in arbitrary order; your solutions technically allows that but only if you know ahead of time what words are supposed to match to begin with, which is not a simpler problem than the original one – Stephen Lin Mar 14 '13 at 19:44
  • @Satyajit (and the tagging algorithm could do the same thing, if you knew ahead of time which words were supposed to line up) – Stephen Lin Mar 14 '13 at 19:45