3

df1

enter image description here

df2

enter image description here

df3

enter image description here

library(dplyr)
library(fuzzyjoin)
df1  <- tibble(a =c("Apple Pear Orange", "Sock Shoe Hat", "Cat Mouse Dog"))
df2  <- tibble(b =c("Kiwi Lemon Apple", "Shirt Sock Glove", "Mouse Dog"),
               c = c("Fruit", "Clothes", "Animals"))
# Appends 'Animals'
df3 <-  regex_left_join(df1,df2, c("a" = "b"))
# Appends Nothing
df3 <-  stringdist_left_join(df1, df2,  by = c("a" = "b"), max_dist = 3, method = "lcs")

I want to append column c of df2 to df1 using the strings, 'Apple', 'Sock' and 'Mouse Dog'.

I tried doing this with regex_join and fuzzyjoin but the order of the string seems to matter, and can't seem to find a way around it.

rsylatian
  • 429
  • 2
  • 14

1 Answers1

4

regex_left_join works, but it isn't just looking for any similarities. As it says in the description,

Join a table with a string column by a regular expression column in another table

So, we need to provide a regex pattern. If df2$b contains separate words of interest, we may do

(df2$regex <- gsub(" ", "|", df2$b))
# [1] "Kiwi|Lemon|Apple" "Shirt|Sock|Glove" "Mouse|Dog"      

and then

regex_left_join(df1, df2, by = c(a = "regex"))[-ncol(df1) - ncol(df2)]
# A tibble: 3 x 3
#   a                 b                c      
#   <chr>             <chr>            <chr>  
# 1 Apple Pear Orange Kiwi Lemon Apple Fruit  
# 2 Sock Shoe Hat     Shirt Sock Glove Clothes
# 3 Cat Mouse Dog     Mouse Dog        Animals

where -ncol(df1) - ncol(df2) simply drops the last column containing the regex patterns.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • Thanks, just about to accept it. Just a quick question first: I substituted the '|' with a comma and it didn't work -- it's the or operator which allows this to work, yes? – rsylatian Jan 07 '19 at 18:40
  • 1
    @rsylatian, indeed! The pipe "|" stands for the OR operator in regex. So "Mouse|Dog" means: match "Mouse" or "Dog". Due to this possibility to write regex patterns, this is indeed a powerful function and could also handle more complicated cases. – Julius Vainora Jan 07 '19 at 18:42