I'm working on string distance in multi-word strings, as in this toy data:
df <- data.frame(
col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
I'd like to determine the (dis)similarity of each row compared to the next row on a word-by-word basis. I use this code:
library(dplyr)
library(tidyr)
library(stringdist)
df %>%
mutate(col2 = lead(col1, 1),
id = row_number()) %>%
pivot_longer(
# select columns:
cols = c(col1, col2),
# determine name of new column:
names_to = c(".value", "Col_N"),
# define capture groups (...) for new column:
names_pattern = "^([a-z]+)([0-9])$") %>%
# separate each word into its own row:
separate_rows(col, sep = "\\s") %>%
# recast into wider format:
pivot_wider(id_cols = c(id, Col_N),
names_from = Col_N,
values_from = col) %>%
# unnest lists:
unnest(.) %>%
# calculate string distance:
mutate(distance = stringdist(`1`, `2`)) %>%
group_by(id) %>%
# reconnect same-string words and distance values:
summarise(col1 = str_c(unique(`1`), collapse = " "),
col2 = str_c(unique(`2`), collapse = " "),
distance = str_c(distance, collapse = ", "))
# A tibble: 5 x 4
id col1 col2 distance
* <int> <chr> <chr> <chr>
1 1 ab ab bc 0, 2
2 2 ab bc yyyy 4, 4
3 3 yyyy yyyy pw hhhh 0, 4, 4
4 4 yyyy pw hhhh wstjz 5, 5, 5
5 5 wstjz NA NA
While the result seems to be okay, there are three problems with it: a) there are a number of warnings, b) the code seems quite convoluted, and c) distance
is of type character. So I'm wondering if there's a better way to determine word-by-word the (dis)similiarity of strings?