0

For instance:

a<-c("This sentence has San-Francisco","This one has london","This one has newYork")
b<-c(10,20,30)

data1<-as.data.frame(cbind(a,b))

c<-c("San Francisco","London", "New York")
d<-c(100,2050,100)

data2<-as.data.frame(cbind(c,d))

So I want to merge data 1 with data 2, specifically by matching column a with column c. The issue is that the names of the cities are differently spelled and the sentence often contains the name of the city in different points. I've tried used the fuzzjoin package but I get extremely few matches. Is there a way to automate this? Essentially I'd like to get

This

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87

1 Answers1

0

You can clean up the data to make things easier, here using stringr (there are many possible approaches to this):

What I'm doing here is removing all punctuation, capitalization and spacing from a and then doing the same to c. With the strings in a and c thus simplified it's easier to extract matches between them (my variable city) and join.

library(stringr)
library(dplyr)
library(purrrr)
a <-
  c(
    "This sentence has San-Francisco",
    "This one has london",
    "This one has newYork",
    "Here also San Francisco"
  )
a_test <- str_replace_all(a, " ", "")
a_test <- str_replace_all(a_test, "[:punct:]", "")
a_test <- str_to_lower(a_test)

b <- c(10, 20, 30, 40)

c <- c("San Francisco", "London", "New York")
c_test <- str_replace_all(c, " ", "")
c_test <- str_to_lower(c_test)

d <- c(100, 2050, 100)

city <- map(a_test, str_extract, c_test) %>%
  unlist() %>%
  na.omit()

data1 <- as.data.frame(cbind(a, city, b))

data2 <- as.data.frame(cbind(c, c_test, d))

inner_join(data1, data2, by = c("city" = "c_test")) %>%
  dplyr::select(a, b, c, d)
1 This sentence has San-Francisco 10 San Francisco  100
2             This one has london 20        London 2050
3            This one has newYork 30      New York  100
4         Here also San Francisco 40 San Francisco  100
Greg
  • 3,570
  • 5
  • 18
  • 31
  • This is a good approach. Thank you so much. I have one small question about the "str_extract" step. My two dataframes are of different lengths. Is there a way to extract matching patterns from longer strings when the pattern is a different length? – Syed Mohammad Ali Abidi Jul 25 '20 at 21:39
  • There is, yes, using `purrr::map`. I've edited the answer to show, with `a` now having four sentences. – Greg Jul 26 '20 at 02:05
  • This is excellent! Thank you so much. One last question, if you're able to help. I can't seem to find a one to one version of map(). In the sense that I only want one guess per column. Is there a way to tidy that up? – Syed Mohammad Ali Abidi Jul 27 '20 at 01:01