In R distance between two sentences: Word-level comparison by minimum edit distance

Question

While trying to learn R, I want to implement the algorithm below in R. Consider the two lists below:

List 1: "crashed", "red", "car"
List 2: "crashed", "blue", "bus"

I want to find out how many actions it would take to transform 'list1' into 'list2'. As you can see I need only two actions: 1. Replace "red" with "blue". 2. Replace "car" with "bus".

But, how we can find the number of actions like this automatically. We can have several actions to transform the sentences: ADD, REMOVE, or REPLACE the words in the list. Now, I will try my best to explain how the algorithm should work:

At the first step: I will create a table like this:

rows: i= 0,1,2,3, columns: j = 0,1,2,3

(example: value[0,0] = 0 , value[0, 1] = 1 ...)

                 crashed    red     car
         0          1        2       3

crashed  1
blue     2
bus      3

Now, I will try to fill the table. Please, note that each cell in the table shows the number of actions we need to do to reformat the sentence (ADD, remove, or replace). Consider the interaction between "crashed" and "crashed" (value[1,1]), obviously we don't need to change it so the value will be '0'. Since they are the same words. Basically, we got the diagonal value = value[0,0]

                 crashed    red     car
         0          1        2       3

crashed  1          0
blue     2
bus      3

Now, consider "crashed" and the second part of the sentence which is "red". Since they are not the same word we can use calculate the number of changes like this :

min{value[0,1] , value[0,2] and value[1,1]} + 1 
min{ 1, 2, 0} + 1 = 1

Therefore, we need to just remove "red". So, the table will look like this:

                 crashed    red     car
         0          1        2       3

crashed  1          0        1
blue     2  
bus      3

And we will continue like this : "crashed" and "car" will be :

min{value[0,3], value[0,2] and value[1,2]} + 1 
min{3, 2, 1} +1 = 2

and the table will be:

                 crashed    red     car
         0          1        2       3

crashed  1          0        1       2
blue     2  
bus      3

And we will continue to do so. the final result will be :

             crashed    red     car
         0      1        2       3

crashed  1      0        1       2
blue     2      1        1       2
bus      3      2        2       2

As you can see the last number in the table shows the distance between two sentences: value[3,3] = 2

Basically, the algorithm should look like this:

 if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] & 
                                            value[i,j] == value[i+1][j-1] )

then {get the 'DIAGONAL VALUE' #diagonal value= value[i, j-1]}

else{
value[i,j] = min(value[i-1, j], value[i-1, j-1],  value[i, j-1]) + 1
 }
  endif

for finding the difference between the elements of two lists that you can see in the header and the column of the matrix, I have used the strcmp() function which will give us a boolean value(TRUE or FALSE) while comparing the words. But, I fail at implementing this. I'd appreciate your help on this one, thanks.

Thanks for the better description. This is by far easier to understand. I will be posting a response in a couple of minutes. :-) — Oliver, Feb 12 '19 at 14:55
@Oliver I tried to be clear :-P . Being clear is not where I shine, I will check your answer but at first glance, it looks great. Thank you. — Zero, Feb 12 '19 at 16:20
Glad i could help. Hopefully if someone knows a better answer they will provide it. Any questions can be asked as a comment in my answer. :-) — Oliver, Feb 12 '19 at 16:28

Oliver · Accepted Answer · 2019-08-13T18:27:35.493

The question

After some clarification in a previous post, and after the update of the post, my understanding is that Zero is asking: 'how one can iteratively count the number of word differences in two strings'.

I am unaware of any implementation in R, although i would be surprised if i doesn't already exists. I took a bit of time out to create a simple implementation, altering the algorithm slightly for simplicity (For anyone not interested scroll down for 2 implementations, 1 in pure R, one using the smallest amount of Rcpp). The general idea of the implementation:

Initialize with string_1 and string_2 of length n_1 and n_2
Calculate the cumulative difference between the first min(n_1, n_2) elements,
Use this cumulative difference as the diagonal in the matrix
Set the first off-diagonal element to the very first element + 1
Calculate the remaining off diagonal elements as: diag(i) - diag(i-1) + full_matrix(i-1,j)
In the previous step i iterates over diagonals, j iterates over rows/columns (either one works), and we start in the third diagonal, as the first 2x2 matrix is filled in step 1 to 4
Calculate the remaining abs(n_1 - n_2) elements as full_matrix[,min(n_1 - n_2)] + 1:abs(n_1 - n_2), applying the latter over each value in the prior, and bind them appropriately to the full_matrix.

The output is a matrix with dimensions row and column names of the corresponding strings, which has been formatted for some easier reading.

Implementation in R

Dist_between_strings <- function(x, y, 
                                 split = " ", 
                                 split_x = split, split_y = split, 
                                 case_sensitive = TRUE){
  #Safety checks
  if(!is.character(x) || !is.character(y) || 
     nchar(x) == 0 || nchar(y) == 0)
    stop("x, y needs to be none empty character strings.")
  if(length(x) != 1 || length(y) != 1)
    stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
  if(!is.logical(case_sensitive))
    stop("case_sensitivity needs to be logical")
  #Extract variable names of our variables
  # used for the dimension names later on
  x_name <- deparse(substitute(x))
  y_name <- deparse(substitute(y))
  #Expression which when evaluated will name our output
  dimname_expression <- 
    parse(text = paste0("dimnames(output) <- list(",make.names(x_name, unique = TRUE)," = x_names,",
                        make.names(y_name, unique = TRUE)," = y_names)"))
  #split the strings into words
  x_names <- str_split(x, split_x, simplify = TRUE)
  y_names <- str_split(y, split_y, simplify = TRUE)
  #are we case_sensitive?
  if(isTRUE(case_sensitive)){
    x_split <- str_split(tolower(x), split_x, simplify = TRUE)
    y_split <- str_split(tolower(y), split_y, simplify = TRUE)
  }else{
    x_split <- x_names
    y_split <- y_names
  }
  #Create an index in case the two are of different length
  idx <- seq(1, (n_min <- min((nx <- length(x_split)),
                              (ny <- length(y_split)))))
  n_max <- max(nx, ny)
  #If we have one string that has length 1, the output is simplified
  if(n_min == 1){ 
    distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
    output <- matrix(distances, nrow = nx)
    eval(dimname_expression)
    return(output)
  }
  #If not we will have to do a bit of work
  output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
  #The loop will fill in the off_diagonal
  output[2, 1] <- output[1, 2] <- output[1, 1] + 1 
  if(n_max > 2)
    for(i in 3:n_min){
      for(j in 1:(i - 1)){
        output[i,j] <- output[j,i] <- output[i,i] - output[i - 1, i - 1] + #are the words different?
          output[i - 1, j] #How many words were different before?
      }
    }
  #comparison if the list is not of the same size
  if(nx != ny){
    #Add the remaining words to the side that does not contain this
    additional_words <- seq(1, n_max - n_min)
    additional_words <- sapply(additional_words, function(x) x + output[,n_min])
    #merge the additional words
    if(nx > ny)
      output <- rbind(output, t(additional_words))
    else
      output <- cbind(output, additional_words)
  }
  #set the dimension names, 
  # I would like the original variable names to be displayed, as such i create an expression and evaluate it
  eval(dimname_expression)
  output
}

Note that the implementation is not vectorized, and as such can only take single string inputs!

Testing the implementation

To test the implementation, one could use the strings given. As they were said to be contained in lists, we will have to convert them to strings. Note that the function lets one split each string differently, however it assumes space separated strings. So first I'll show how one could achieve a conversion to the correct format:

list_1 <- list("crashed","red","car")
list_2 <- list("crashed","blue","bus")
string_1 <- paste(list_1,collapse = " ")
string_2 <- paste(list_2,collapse = " ")
Dist_between_strings(string_1, string_2)

output

#Strings in the given example
         string_2
string_1  crashed blue bus
  crashed       0    1   2
  red           1    1   2
  car           2    2   2

This is not exactly the output, but it yields the same information, as the words are ordered as they were given in the string. More examples Now i stated it worked for other strings as well and this is indeed the fact, so lets try some random user-made strings:

#More complicated strings
string_3 <- "I am not a blue whale"
string_4 <- "I am a cat"
string_5 <- "I am a beautiful flower power girl with monster wings"
string_6 <- "Hello"
Dist_between_strings(string_3, string_4, case_sensitive = TRUE)
Dist_between_strings(string_3, string_5, case_sensitive = TRUE)
Dist_between_strings(string_4, string_5, case_sensitive = TRUE)
Dist_between_strings(string_6, string_5)

Running these show that these do yield the correct answers. Note that if either string is of size 1, the comparison is a lot faster.

Benchmarking the implementation

Now as the implementation is accepted, as correct, we would like to know how well it performs (For the uninterested reader, one can scroll past this section, to where a faster implementation is given). For this purpose, i will use much larger strings. For a complete benchmark i should test various string sizes, but for the purposes i will only use 2 rather large strings of size 1000 and 2500. For this purpose i use the microbenchmark package in R, which contains a microbenchmark function, which claims to be accurate down to nanoseconds. The function itself executes the code 100 (or a user defined) number of times, returning the mean and quartiles of the run times. Due to other parts of R such as the Garbage Cleaner, the median is mostly considered a good estimate of the actual average run-time of the function. The execution and results are shown below:

#Benchmarks for larger strings
set.seed(1)
string_7 <- paste(sample(LETTERS,1000,replace = TRUE), collapse = " ")
string_8 <- paste(sample(LETTERS,2500,replace = TRUE), collapse = " ")
microbenchmark::microbenchmark(String_Comparison = Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr                   min      lq      mean   median       uq      max neval
# String_Comparison 716.5703 729.4458 816.1161 763.5452 888.1231 1106.959   100

Profiling

Now i find the run-times very slow. One use case for the implementation could be an initial check of student hand-ins to check for plagiarism, in which case a low difference count very likely shows plagiarism. These can be very long and there may be hundreds of handins, an as such i would like the run to be very fast. To figure out how to improve my implementation i used the profvis package with the corrosponding profvis function. To profile the function i exported it in another R script, that i sourced, running the code 1 once prior to profiling to compile the code and avoid profiling noise (important). The code to run the profiling can be seen below, and the most important part of the output is visualized in an image below it.

library(profvis)
profvis(Dist_between_strings(string_7, string_8, case_sensitive = FALSE))

Now, despite the colour, here i can see a clear problem. The loop filling the off-diagonal by far is responsible for most of the runtime. R (like python and other not compiled languages) loops are notoriously slow.

Using Rcpp to improve performance

To improve the implementation, we could implement the loop in c++ using the Rcpp package. This is rather simple. The code is not unlike the one we would use in R, if we avoid iterators. A c++ script can be made in file -> new file -> c++ File. The following c++ code would be pasted into the corrosponding file and sourced using the source button.

//Rcpp Code
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix Cpp_String_difference_outer_diag(NumericMatrix output){
  long nrow = output.nrow();
  for(long i = 2; i < nrow; i++){ // note the 
    for(long j = 0; j < i; j++){
      output(i, j) = output(i, i) - output(i - 1, i - 1) + //are the words different?
                                  output(i - 1, j);
      output(j, i) = output(i, j);
    }
  }
  return output;
}

The corresponding R function needs to be altered to use this function instead of looping. The code is similar to the first function, only switching the loop for a call to the c++ function.

Dist_between_strings_cpp <- function(x, y, 
                                 split = " ", 
                                 split_x = split, split_y = split, 
                                 case_sensitive = TRUE){
  #Safety checks
  if(!is.character(x) || !is.character(y) || 
     nchar(x) == 0 || nchar(y) == 0)
    stop("x, y needs to be none empty character strings.")
  if(length(x) != 1 || length(y) != 1)
    stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
  if(!is.logical(case_sensitive))
    stop("case_sensitivity needs to be logical")
  #Extract variable names of our variables
  # used for the dimension names later on
  x_name <- deparse(substitute(x))
  y_name <- deparse(substitute(y))
  #Expression which when evaluated will name our output
  dimname_expression <- 
    parse(text = paste0("dimnames(output) <- list(", make.names(x_name, unique = TRUE)," = x_names,",
                        make.names(y_name, unique = TRUE)," = y_names)"))
  #split the strings into words
  x_names <- str_split(x, split_x, simplify = TRUE)
  y_names <- str_split(y, split_y, simplify = TRUE)
  #are we case_sensitive?
  if(isTRUE(case_sensitive)){
    x_split <- str_split(tolower(x), split_x, simplify = TRUE)
    y_split <- str_split(tolower(y), split_y, simplify = TRUE)
  }else{
    x_split <- x_names
    y_split <- y_names
  }
  #Create an index in case the two are of different length
  idx <- seq(1, (n_min <- min((nx <- length(x_split)),
                              (ny <- length(y_split)))))
  n_max <- max(nx, ny)
  #If we have one string that has length 1, the output is simplified
  if(n_min == 1){ 
    distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
    output <- matrix(distances, nrow = nx)
    eval(dimname_expression)
    return(output)
  }
  #If not we will have to do a bit of work
  output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
  #The loop will fill in the off_diagonal
  output[2, 1] <- output[1, 2] <- output[1, 1] + 1 
  if(n_max > 2) 
    output <- Cpp_String_difference_outer_diag(output) #Execute the c++ code
  #comparison if the list is not of the same size
  if(nx != ny){
    #Add the remaining words to the side that does not contain this
    additional_words <- seq(1, n_max - n_min)
    additional_words <- sapply(additional_words, function(x) x + output[,n_min])
    #merge the additional words
    if(nx > ny)
      output <- rbind(output, t(additional_words))
    else
      output <- cbind(output, additional_words)
  }
  #set the dimension names, 
  # I would like the original variable names to be displayed, as such i create an expression and evaluate it
  eval(dimname_expression)
  output
}

Testing the c++ implementation

To be sure the implementation is correct we check if the same output is obtained with the c++ implementation.

#Test the cpp implementation
identical(Dist_between_strings(string_3, string_4, case_sensitive = TRUE),
          Dist_between_strings_cpp(string_3, string_4, case_sensitive = TRUE))
#TRUE

Final benchmarks

Now is this actually faster? To see this we could run another benchmark using the microbenchmark package. The code and results are shown below:

#Final microbenchmarking
microbenchmark::microbenchmark(R = Dist_between_strings(string_7, string_8, case_sensitive = FALSE),
                               Rcpp = Dist_between_strings_cpp(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr       min       lq      mean    median        uq       max neval
# R    721.71899 753.6992 850.21045 787.26555 907.06919 1756.7574   100
# Rcpp  23.90164  32.9145  54.37215  37.28216  47.88256  243.6572   100

From the microbenchmark median improvement factor of roughly 21 ( = 787 / 37), which is a massive improvement from just implementing a single loop!

I know it's been a long time since you've answered this question, but I have a follow-up question: I'm trying to use the `Dist-between_strings()` in a `for loop`. The objective is to use the function on a list of strings, but it does not work; however, it works very well on each entity of the list. Do you have any clue about this potential problem? — Zero, Aug 13 '19 at 07:34
I cannot be certain, but it is likely an indexing error. The functions only allow for single strings as input. Lets say you wanted to compared 2 vectors of strings, doing a pairwise comparison (first element compared in each vector, second element compared... etc.), you could use a `n <- length(vector_string1); output <- vector("list", n); for( i in seq(n)); output[i] <- Dist_between_strings_cpp(vector_strign1[i], vector_string2[i])`. — Oliver, Aug 13 '19 at 09:59
If by contrast, you wanted every element to be compared to every element in the other vector, you could first use `expand.grid` to create every combination of strings, and then iterate through these combinations: `vector_string_grid <- expand.grid(String1 = vector_string1, String2 = vector_string2); n2 <- nrow(vector_string_grid); output2 <- vector("list", n2); for( i in seq(n2)); output2[i] <- Dist_between_strings_cpp(vector_string_grid[i, "String1"], vector_string_grid[i, "String2"]);` — Oliver, Aug 13 '19 at 10:06
untested example data: `vector_string1 <- c("hello kitty is green","kitty cat is gray","cat holds farts"); vector_string2 <- c("my lovely kitty is green","kitty cat is blue","gray cat farts")` — Oliver, Aug 13 '19 at 10:06
Thanks for your reply, I have tested your suggestion for the pairwise comparison and I have received the same previous error : `Error in parse(text = paste0("dimnames(output) <- list(", x_name, " = x_names,", : :1:44: unexpected '=' 1: dimnames(output) <- list(vector_string1[i] = ^` I guess it does not like receiving vectors ;) — Zero, Aug 13 '19 at 16:33
That error message did the trick, and the fix is suprisingly simple (he said after an hour of searching, and then stumbling over it by accident in `data.frame`'s source code.) Simply the `x_name` and `y_name` isn't compatible with R's naming convention if they contain subsets. But! R's `make.names` function can fix that. Change the dimension naming part to `dimname_expression <- parse(text = paste0("dimnames(output) <- list(", make.names(x_name, unique = TRUE), " = x_names,",make.names(x_name, unique = TRUE), " = y_names)"))`. I've edited my post, to reflect this (might not be updated yet) — Oliver, Aug 13 '19 at 18:24

AkselA · Answer 2 · 2019-02-12T15:22:30.840

-1

There is already an edit-distance function in R we can take advantage of: adist().

As it works on the character level, we'll have to assign a character to each unique word in our sentences, and stitch them together to form pseudo-words we can calculate the distance between.

s1 <- c("crashed", "red", "car")
s2 <- c("crashed", "blue", "bus")

ll <- list(s1, s2)

alnum <- c(letters, LETTERS, 0:9)

ll2 <- relist(alnum[factor(unlist(ll))], ll)

ll2 <- sapply(ll2, paste, collapse="")

adist(ll2)
#      [,1] [,2]
# [1,]    0    2
# [2,]    2    0

Main limitation here, as far as I can tell, is the number of unique characters available, which in this case is 62, but can be extended quite easily, depending on your locale. E.g: intToUtf8(c(32:126, 161:300), TRUE).

edited Feb 12 '19 at 15:22

answered Feb 12 '19 at 15:06

AkselA

8,153
2
21
34

1

This does not seem to fit the actual description in the question asked. Zero seems to ask for iterative word distance, and adist counts the number of character changes to return the same string for each pair in the vector. Another implementation might exist, but in my answer below an implementation of the specific problem is also availible. – Oliver Feb 12 '19 at 15:21
@AkselA I have already tried the adist() function, as you can see it does not give the value I am looking for. It compares the changes in characters. Anyway, thanks for your help. – Zero Feb 12 '19 at 16:01
@Zero: `adist()` won't give the partial distances, but the `2` in the `adist()` output is the same `2` as found at the end of your comparison matrix. Oliver's answer is far more complete, but `adist()` is still a partial solution, making this a valid answer, I think. – AkselA Feb 12 '19 at 17:02