1

I have two data.frame one containing partial name and the other one containing full name as follow

partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF",
"wizz air", "WeMove.eu", "ILU")
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe",
"World Wide Fundation (WWF)", "(ILU)", "Ilusion")

In the ideal world, I would love to have a table like this (my real partial df has 12 794 rows)

print(partial)
partial full
Apple   Apple Inc
Apple   Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows

For every row without an answer I would like to be NA

I tried many things, fuzzyjoin with regex, regex_left_join even with the package sqldf. I have some results but I know it would be better if regex_left_join understand that I am looking for words I know in stringr , boundary( type = c("word")) exist but I do not know of to implement it.

For now, I just prepared the partial df, to get rid of the non-alphanumerical information and to make it lowercase.

partial$regex <- str_squish((str_replace_all(partial$partial.name, regex("\\W+"), " ")))
partial$regex <- tolower(partial$regex)

How can I match partial$partial.name with full$full.name based on the maximum number of words in common?

Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41
JMCrocs
  • 77
  • 7

1 Answers1

1

Partial string matching is time consuming to get right. I believe the Jaro-Winkler distance is a good candidate but you would need to spend time tweaking parameters. Here's an example to get you going.

library(stringdist)

partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU", 'None'), stringsAsFactors = F)
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe", "World Wide Foundation (WWF)", "(ILU)", "Ilusion"), stringsAsFactors = F)

mydist <- function(partial, list_of_fulls, method='jw', p = 0, threshold = 0.4) {
    find_dist <- function(first, second, method = method, p = p) {
        stringdist(a = first, b = second, method = method, p = p)
    }
    distances <- unlist(lapply(list_of_fulls, function(full) find_dist(first = full, second = partial, method = method, p = p)))
    # If the distance is too great assume NA 
    if (min(distances) > threshold) {
        NA
    } else {
        closest_index <- which.min(distances)
        list_of_fulls[closest_index]
    }
}

partial$match <- unlist(lapply(partial$partial.name, function(partial) mydist(partial = partial, list_of_fulls = full$full.name, method = 'jw')))

partial
#  partial.name                       match
#1        Apple                   Apple Inc
#2        Apple                   Apple Inc
#3          WWF World Wide Foundation (WWF)
#4     wizz air                     wizzair
#5    WeMove.eu              We Move Europe
#6          ILU                       (ILU)
#7         None                        <NA>
Andrew Chisholm
  • 6,362
  • 2
  • 22
  • 41
  • Thank you for your help! Sadly it does not give great results, would it be possible to create a function based on finding the highest number of matching words based on something like this? [link]https://stackoverflow.com/questions/25930550/count-common-words-in-two-strings I am trying but this is definitely too hard for me to make it right apparently ! – JMCrocs May 08 '21 at 20:37
  • I opened a new question trying to make it clearer. Thank you again for your help! https://stackoverflow.com/questions/67452050/finding-the-highest-number-of-matching-word-between-two-string-lists-in-r – JMCrocs May 08 '21 at 20:54