library(dplyr)
library(fuzzyjoin)
df1 <- data.frame(x = c("Socks", "Mouse"))
df2 <- data.frame(y = c("Sock", "House"))
stringdist_left_join(df1, df2,
by = c(x = "y"),
max_dist = 1,
ignore_case = TRUE,
distance_col = "distance")
output:
x y distance
1 Socks Sock 1
2 Mouse House 1
For both comparisons (Socks vs. Sock and Mouse vs. House) I get the same distance. So far so good. But now I want to match the words on a semantic level. In the first comparison (Socks vs. Sock) the difference lies only in plural vs. singular. And I would count this as a match. However in the second comparison (Mouse vs. House) the meaning of the two words differs. I don't want to count this as a match. Any suggestions how I could add a further column (e.g.,"match") where I would have TRUE in the first row (for Socks vs. Sock) and FALSE in the second row (for Mouse vs. House)?
Is there a way to indicate that I want to ignore suffixes? Thinking about something similar to "ignore_case = TRUE" (see code)
I have a long data set with German words. I would prefer a solution that does not require the use of dictionaries (i.e., a solution applicable to more use cases). However, if there is no way around it, I would appreciate details about how to use a German dictionary for my problem.