1
library(dplyr)
library(fuzzyjoin)

df1 <- data.frame(x = c("Socks", "Mouse"))
df2 <- data.frame(y = c("Sock", "House"))

stringdist_left_join(df1, df2, 
                 by = c(x = "y"),
                 max_dist = 1,
                 ignore_case = TRUE,
                 distance_col = "distance")

output:

      x     y distance
1 Socks  Sock        1
2 Mouse House        1

For both comparisons (Socks vs. Sock and Mouse vs. House) I get the same distance. So far so good. But now I want to match the words on a semantic level. In the first comparison (Socks vs. Sock) the difference lies only in plural vs. singular. And I would count this as a match. However in the second comparison (Mouse vs. House) the meaning of the two words differs. I don't want to count this as a match. Any suggestions how I could add a further column (e.g.,"match") where I would have TRUE in the first row (for Socks vs. Sock) and FALSE in the second row (for Mouse vs. House)?

Is there a way to indicate that I want to ignore suffixes? Thinking about something similar to "ignore_case = TRUE" (see code)

I have a long data set with German words. I would prefer a solution that does not require the use of dictionaries (i.e., a solution applicable to more use cases). However, if there is no way around it, I would appreciate details about how to use a German dictionary for my problem.

Mirela
  • 11
  • 3
  • 1
    What you are describing with your Socks/Sock example is called [stemming](https://en.wikipedia.org/wiki/Stemming). An idea would be to first stem the words and then join them using a non-fuzzy join. An example package is [corpus](https://cran.r-project.org/web/packages/corpus/vignettes/stemmer.html) (I have never used it but it seems to do what you want). – Bas Aug 13 '20 at 20:03
  • @Bas, thank you! Your suggestion actually did solve this specific problem. – Mirela Dec 01 '20 at 16:53

1 Answers1

0

stringdist_fuzzy_join method = "soundex" might help. The different methods have different distances and then you can set a max distance (but it'll be hard to get perfect matches for all cases).

You might find this helpful too: https://cran.r-project.org/web/packages/fuzzyjoin/vignettes/stringdist_join.html

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50