1

I have 2 datasets with name. One with exact names and the other with exact and modified names

dt_t <- data.table(Name = list("Aaron RAMSEY", "Mesut OEZIL", "Sergio AGUERO"))
dt_f <- data.table(Name = list("Özil Mesut", "Ramsey Aaron", "Kun Agüero"))

I was thinking of making a table with dt_t in line and dt_f in column with the value of the jarowinkler function (which allows to calculate the similarity of a string) so that dt_f[i] is replaced by the dt_t[i] which has the highest jarowinkler value.

But I don't know how to do it, mutch less if it's possible.

Any idea are welcome

Thanks

P. Vauclin
  • 367
  • 1
  • 2
  • 10
  • You might want to have a look at the [adist](https://www.rdocumentation.org/packages/utils/versions/3.5.1/topics/adist) function. – ismirsehregal Nov 14 '18 at 22:35
  • 1
    running first `rvest::repair_encoding(c("Özil Mesut", "Ramsey Aaron", "Kun Agüero"))` will give you `[1] "Özil Mesut" "Ramsey Aaron" "Kun Agüero"` which might help you get better matches. – moodymudskipper Nov 14 '18 at 23:25

1 Answers1

1

Here is a solution using adist:

library(data.table)

dt_t <- data.table(Name = list("Aaron RAMSEY", "Mesut OEZIL", "Sergio AGUERO"))
dt_f <- data.table(Name = list("Özil Mesut", "Ramsey Aaron", "Kun Agüero"))

string_dist <- adist(dt_t$Name, dt_f$Name, partial=TRUE, ignore.case=TRUE)

match_idx <- apply(string_dist, 2, which.min)

dt_match <- cbind(dt_t, dt_f[match_idx])

Edit ---------------------------------

Applying it row-wise:

library(data.table)

dt_t <- data.table(Name = (list("Aaron RAMSEY", "Mesut OEZIL", "Sergio AGUERO")))
dt_f <- data.table(Name = list("Özil Mesut", "Ramsey Aaron", "Kun Agüero"))

minDistMatch <- function(x, y){
  x <- as.list(x)
  y <- as.list(y)
  y[which.min(adist(x, y, partial=TRUE, ignore.case=TRUE))]
  }

dt_t[, Match := vapply(Name, minDistMatch, list(1L), dt_f$Name)]
ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
  • Maybe applying it row-wise helps out for more data? Please see my edit. Also have a look at [agrep](https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/agrep)/agrepl. – ismirsehregal Nov 15 '18 at 10:49