Determine (dis)similarity of multi-word strings on a word-by-word basis

Question

I'm working on string distance in multi-word strings, as in this toy data:

df <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)

I'd like to determine the (dis)similarity of each row compared to the next row on a word-by-word basis. I use this code:

library(dplyr)
library(tidyr)
library(stringdist)
df %>%
  mutate(col2 = lead(col1, 1),
         id = row_number()) %>%
  pivot_longer(
    # select columns:
    cols = c(col1, col2),
    # determine name of new column:
    names_to = c(".value", "Col_N"), 
    # define capture groups (...) for new column:
    names_pattern = "^([a-z]+)([0-9])$") %>%
  # separate each word into its own row:
  separate_rows(col, sep = "\\s") %>%
  # recast into wider format:
  pivot_wider(id_cols = c(id, Col_N), 
              names_from = Col_N, 
              values_from = col) %>%
  # unnest lists:
  unnest(.) %>%
  # calculate string distance:
  mutate(distance = stringdist(`1`, `2`)) %>%
  group_by(id) %>%
  # reconnect same-string words and distance values:
  summarise(col1 = str_c(unique(`1`), collapse = " "),
            col2 = str_c(unique(`2`), collapse = " "),
            distance = str_c(distance, collapse = ", "))
# A tibble: 5 x 4
     id col1         col2         distance
* <int> <chr>        <chr>        <chr>   
1     1 ab           ab bc        0, 2    
2     2 ab bc        yyyy         4, 4    
3     3 yyyy         yyyy pw hhhh 0, 4, 4 
4     4 yyyy pw hhhh wstjz        5, 5, 5 
5     5 wstjz        NA           NA

While the result seems to be okay, there are three problems with it: a) there are a number of warnings, b) the code seems quite convoluted, and c) distance is of type character. So I'm wondering if there's a better way to determine word-by-word the (dis)similiarity of strings?

glagla · Answer 1 · 2021-10-22T09:47:15.767

2

A solution:

df <- data.frame(
  col1 = col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz"),
  stringsAsFactors=FALSE
)

comps = function(a.row){
  paste(stringdist(unlist(strsplit(as.character(a.row[1]), ' ')), 
                   unlist(strsplit(as.character(a.row[2]), ' '))), 
        collapse = ' ')
  
}
df %>%
  mutate(col2 = lead(col1, 1)) %>%
         mutate(distance = apply(., 1, comps))

there should be a way to not have to use the as.character in the strsplit function
I'm not sure that you can have a column of vectors in a dataframe, this might be why all the warnings and the character type for the distance. I here cast the distance into a string to keep all the values in the same column.

edited Oct 22 '21 at 09:47

answered Oct 22 '21 at 08:52

glagla

611
4
9

Cool answer! Thanks. – Chris Ruehlemann Oct 22 '21 at 09:09
Your solution based on the original toy data in the question, recast in `dplyr` syntax: `df %>% mutate(col2 = lead(col1, 1), distance = apply(., 1, comps))` If you wish you can add this to your post. – Chris Ruehlemann Oct 22 '21 at 09:18

score 1 · Answer 2 · answered Oct 22 '21 at 09:31

1

how about something like this:

mydf <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
mydf


library(dplyr)
library(stringdist)
mydf %>% 
  mutate(col1_lead = lead(col1)) %>% 
  apply(1, function(x){
    stringdist(
      unlist(strsplit(x["col1"], " ")), 
      unlist(strsplit(x["col1_lead"], " "))
    )}
  ) %>% 
  cbind() %>% 
  `colnames<-`("distance") %>% 
  cbind(mydf)

answered Oct 22 '21 at 09:31

DS_UNI

2,600
2
11
22

oops! I just noticd that my answer is pretty similar to [that](https://stackoverflow.com/a/69674029/4905565]) of [glagla](https://stackoverflow.com/users/10145118/glagla) – DS_UNI Oct 22 '21 at 09:34
I guess the difference is that I added the column of results in mine as a list of numerical vectors – DS_UNI Oct 22 '21 at 09:36

cuttlefish44 · Answer 3 · 2021-10-22T09:19:16.797

Below is my simple honesty idea.
I make list-cols having words and calculate dist row by row with unlist (because stringdist need vector). And keep the dist as list-column.

ans <- df %>%
  as_tibble() %>% 
  mutate(id = row_number(),   # not use
         col2 = lead(col1, 1),
         sep_col1 = str_split(col1, " "),
         sep_col2 = str_split(col2, " ")) %>%    # or str_split(lead(col1, 1))
  rowwise() %>% 
  mutate(dist = list(stringdist(unlist(sep_col1), unlist(sep_col2))),
         for_just_look = paste(dist, collapse = ", ")) %>% 
  ungroup()

ans

#  col1            id col2         sep_col1  sep_col2  dist     for_just_look
#  <chr>        <int> <chr>        <list>    <list>    <list>    <chr>   
# 1 ab               1 ab bc        <chr [1]> <chr [2]> <dbl [2]> 0, 2    
# 2 ab bc            2 yyyy         <chr [2]> <chr [1]> <dbl [2]> 4, 4    
# 3 yyyy             3 yyyy pw hhhh <chr [1]> <chr [3]> <dbl [3]> 0, 4, 4 
# 4 yyyy pw hhhh     4 wstjz        <chr [3]> <chr [1]> <dbl [3]> 5, 5, 5 
# 5 wstjz            5 NA           <chr [1]> <chr [1]> <dbl [1]> NA

Merijn van Tilborg · Answer 4 · 2021-10-22T10:52:37.260

Without my comments below, just straightforward would be this.

library(data.table)
setDT(df)

df[, col1 := list(str_split(col1, " "))]
df[, col2 := lead(col1, 1)]
df[, distance := lapply(.I, function(x) { stringdist(col1[x][[1]], col2[x][[1]]) })]

Be carefull with any stringdist like function, on a huge dataset it is quite intense to make all comparisons. Also keep in mind what you are going to use the values distances for. Are you truly intestested in the disctance? Or are you interested in like all with a distance < x ? If so most likely a compared to axxxxxxxxxxxxxxx you do not consider a close match right, but you could see that difference by the length of the string for example which takes way less resources to calculate than the actual distance.

Also it would be a waste of computation to blindly compute row by row, lets just make a tiny longer sample set.

c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "yyyy", "yyyy pw hhhh", "wstjz", "wstjz")

here you would calculate 3x the disctance between yyyy and yyyy which should be done once (well actually you should capture those by "is equal" first), 3x yyyy and hhhh / hhhh and yyyy.

With a small dataset you probably do not have to worry, but with large sets and longer strings... it can become messy / slow pretty fast.

Determine (dis)similarity of multi-word strings on a word-by-word basis

4 Answers4