1

Is there a way I can partially match the two data frames in R?

df1<-data.frame("FIDELITY FREEDOM 2015 FUND", "ID")

df2<-data.frame("FIDELITY ABERDEEN STREET TRUST: FIDELITY FREEDOM 2015 FUND", 2020)

I want to merge df1 and df2 as df

df<-data.frame("FIDELITY FREEDOM 2015 FUND", "ID", 2020)

The string of df1 is part of the string of df2. So I am thinking of using fuzzy matching.

mhovd
  • 3,724
  • 2
  • 21
  • 47
Jane
  • 91
  • 4
  • 3
    There is a package `fuzzyjoin` that allows you to specify a custom regex that will be used for the maching; [https://cran.r-project.org/web/packages/fuzzyjoin/index.html](https://cran.r-project.org/web/packages/fuzzyjoin/index.html) – dario Apr 25 '22 at 17:05

1 Answers1

0

Following @dario's suggestion:

library(dplyr)
library(fuzzyjoin)

df1<-data.frame(x = "FIDELITY FREEDOM 2015 FUND", y = "ID")

df2<-data.frame(x = "FIDELITY ABERDEEN STREET TRUST: FIDELITY FREEDOM 2015 FUND", z = 2020)

fuzzy_join(df1, df2, match_fun = \(x,y) grepl(x, y)) %>% 
  select(!ends_with(".y")) %>% rename(x = x.x)

#> Joining by: "x"
#>                            x  y    z
#> 1 FIDELITY FREEDOM 2015 FUND ID 2020
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • Thank you! When I run your codes, it shows: Error: unexpected input in "fuzzy_join(df1, df2, match_fun = \". I'm wondering what's the reason for that? – Jane Apr 25 '22 at 17:47
  • 1
    Try, instead of `match_fun = \(x,y)`, `match_fun = function(x,y)`. – PaulS Apr 25 '22 at 17:49
  • Hi Paul, I have two large datasets to merge and they both have multiple rows. When I try fuzzy_join on them: fuzzy_join(df1, df2, by="x", match_fun = function(x,y) grepl(x, y)), they give me warning, "In grepl(x, y) : argument 'pattern' has length > 1 and only the first element will be used". May I ask what's the problem here? I guess it is because x is a column instead of a single value? Do you have any idea about how to fix it? Thank you!! – Jane Apr 26 '22 at 04:23
  • To give an example, df1<-tibble(x=c("FIDELITY FREEDOM 2015 FUND", "VANGUARD WELLESLEY INCOME FUND"),y=c(1,2)) df2<-tibble(x=c("FIDELITY ABERDEEN STREET TRUST: FIDELITY FREEDOM 2015 FUND", "VANGUARD/WELLESLEY INCOME FUND, INC: VANGUARD WELLESLEY INCOME FUND; INVESTOR SHARES"),z=c(2020,2021)) – Jane Apr 26 '22 at 04:24
  • Jane, you can use instead: `library(tidyverse) library(fuzzyjoin) fuzzy_join(df1, df2, match_fun = \(x,y) str_detect(y, x), by = "x") %>% select(!ends_with(".y")) %>% rename(x = x.x)` – PaulS Apr 26 '22 at 06:26