I'm using the Levenshtein Distance algorithm in C++ to compare two strings to measure how close they are to each other. However, the plain Levenshtein Distance algorithm does not distinguish word boundaries as delimited by spaces. This results in smaller distance calculations than I want. I'm comparing titles to see how close they are to each other and I wish for the algorithm to not count characters as matching if they come from across multiple words.
For example, if I compare these two strings I get the following result with +
designating a match and -
designating a non-match:
Al Chertoff Et
Al Church Department of finance Et
+++++------+--++-----++-+------+++
Al Ch e rt of f Et
I get a get a distance of 20 with the word "Chertoff"
matching across the four words "Church Department of finance"
whereas, I really want them to be considered further apart from each other by not allowing characters to match from more than one word and getting a distance of 25 with the word "Chertoff"
most matching the one word "Department"
, with three characters matching:
Al Chertoff Et
Al Church Department of finance Et
+++--------+--++---------------+++
Al e rt Et
Ch off
How could I adapt the Levenshtein Distance to accomplish this or is there another distance algorithm that would be better suited for this? Perhaps using the Levenshtein distance on each word individually word work and choosing the word with the least distance? However, what if matching one word well deep into the string caused the subsequent words to match poorly because their matches were best earlier in the string? Could this somehow be done with Levenshtein distance adapted to be at a word level?
For example, the shortest distance by this idea for the following more complicated example is 20:
Al Chertoff Deport Et
Al Church Department of finance Et
+++++----++++-++---------------+++
Al Ch Dep rt Et
ertoff o
Instead of maximizing "Chertoff"
's match and getting the longer distance of 24:
Al Chertoff Deport Et
Al Church Department of finance Et
+++--------+--++-----+---------+++
Al e rt o Et
Ch off
Dep rt
My current implementation of the Levenshtein Distance is as follows:
size_t
levenshtein_distance(const std::string& a_compare1,
const std::string& a_compare2) {
const size_t length1 = a_compare1.size();
const size_t length2 = a_compare2.size();
std::vector<size_t> curr_col(length2 + 1);
std::vector<size_t> prev_col(length2 + 1);
// Prime the previous column for use in the following loop:
for (size_t idx2 = 0; idx2 < length2 + 1; ++idx2) {
prev_col[idx2] = idx2;
}
for (size_t idx1 = 0; idx1 < length1; ++idx1) {
curr_col[0] = idx1 + 1;
for (size_t idx2 = 0; idx2 < length2; ++idx2) {
const size_t compare = a_compare1[idx1] == a_compare2[idx2] ? 0 : 1;
curr_col[idx2 + 1] = std::min(std::min(curr_col[idx2] + 1,
prev_col[idx2 + 1] + 1),
prev_col[idx2] + compare);
}
curr_col.swap(prev_col);
}
return prev_col[length2];
}