3

I am looking for a fast solution in R for determining word-level edit distance between two sentences. More specifically, I want to determine minimal number of additions, substitutions or deletions of words, to transform sentence A to sentence B. For example, if sentence A is "very nice car" and sentence B is "nice red car", the result should be 2 (1 deletion and 1 addition).

I know that there are existing solutions in R for character-level edit distance (e.g., native adist() and stringdist() from package 'stringdist'), but I found none for word-level.

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
JackONeill
  • 123
  • 1
  • 9

1 Answers1

3

How about

intersect(strsplit(levels(factor("very nice car"[1]))," ")[[1]],strsplit(levels(factor("nice red car"[1]))," ")[[1]])

> [1] "nice" "car"

length(intersect(strsplit(levels(factor("very nice car"[1]))," ")[[1]],strsplit(levels(factor("nice red car"[1]))," ")[[1]]))

> [1] 2

Of course, you can make your own function that even works with a list:

my_function <- function (x, prsep = " ") 
{
    if (isTRUE(length(x) != 0) == TRUE && isTRUE(is.na(x)) == 
        FALSE) {
        if (isTRUE(is.list(x)) == TRUE) {
            for (i in 1:length(x)) ifelse(isTRUE(length(x[[i]]) != 
                0) == TRUE, x[[i]] <- strsplit(x[[i]], prsep)[[1]], 
                NA)
            return(x)
        }
        else if (isTRUE(is.list(x)) == FALSE) {
            Lt <- list()
            for (i in 1:length(x)) Lt[[length(Lt) + 1]] <- strsplit(levels(factor(x[i])), 
                prsep)[[1]]
            return(Lt[[1]])
        }
    }
    else {
    x
    }
}

So you just need

intersect(my_function("very nice car"," "), my_function("nice red car"," "))

JARO
  • 249
  • 2
  • 12
  • Unfortunately, intersection of two sentences is not the same as word-level distance. For instance, if str1 <- "this red car has low consumption", and str2 <- "this nice red car has low mileage" The intersection of these two sentences is 5, but word-level distance is 2 (1 substitution and 1 addition) – JackONeill Mar 09 '15 at 09:26
  • 1
    That is true. However the intersection and the function help: `inter <- intersect(my_function(str1," "),my_function(str2," ")) my_function(str1,' ')[which(!(my_function(str1,' ')%in%inter))]` gives `> [1] "consumption"` and `my_function(str2,' ')[which(!(my_function(str2,' ')%in%inter))]` gives `> [1] "nice" "mileage"` Thus you can have a distance `max(length(my_function(str1,' ')[which(!(my_function(str1,' ')%in%inter))]),length(my_function(str2,' ')[which(!(my_function(str2,' ')%in%inter))]))` which is `> [1] 2` -- – JARO Mar 09 '15 at 13:25