0

I have a dataframe with two columns which can contain literally any character of various formats and i would like to match them.

library(stringr)
library(fuzzyjoin)

x <- data.frame(idX=1:3, string=c("silver", "30BEDJE202AA", "30BEDJE2027"))
y <- data.frame(idY=letters[1:3], seed=c("sliver", "30BEDJE202ABC", "30BEDJE2027BL"))
x$string = as.character(x$string)
y$seed = as.character(y$seed)

x %>% fuzzy_left_join(y, by = c(string = "seed"), match_fun = str_detect)

Here is the result i get when running the above code:

  idX       string  idY seed
1   1       silver <NA> <NA>
2   2 30BEDJE202AA <NA> <NA>
3   3  30BEDJE2027 <NA> <NA>

And this is what i would like to have:

  idX       string  idY seed
1   1       silver  a sliver
2   2 30BEDJE202AA  b 30BEDJE202ABC
3   3  30BEDJE2027  c 30BEDJE2027BL

Is there a way to get there?

Romain
  • 171
  • 11
  • Why should string 30BEDJE2027 match to seed 30BEDJE2027BL and not 30BEDJE202? – Aron Strandberg Feb 13 '20 at 09:29
  • Thanks to both of you. Valid point Aron. I modified the original post to remove any ambiguity. – Romain Feb 13 '20 at 10:21
  • 1
    @Romain then as tifu said: `stringdist_left_join(x,y,by=c(string="seed"),max_dist=2)` – Tensibai Feb 13 '20 at 10:29
  • Indeed, it works well. Thanks! – Romain Feb 13 '20 at 10:55
  • 1
    (**removed old comment because of critical typo**): You could play around with the different methods in fuzzyjoin::stringdist_join(), but I doubt that will get you your results since the similarity between idX == 2 and idY == 3 seems to simply be higher than between idX == 2 and idY == 2, regardless of what method is used to calculate the distance – tifu Feb 13 '20 at 11:34

0 Answers0