String fuzzy matching in dataframe

Question

I have a dataframe containing the title of an article and the url links associated.

My problem is that the url link is not necessary in the row of the corresponding title, example:

               title                  |                     urls
    Who will be the next president?   | https://website/5-ways-to-make-a-cocktail.com 
    5 ways to make a cocktail         | https://website/who-will-be-the-next-president.com
    2 millions raised by this startup | https://website/how-did-you-find-your-house.com 
    How did you find your house       | https://website/2-millions-raised-by-this-startup.com
    How did you find your house       | https://washingtonpost/article/latest-movies-in-theater.com
    Latest movies in Theater          | www.newspaper/mynews/what-to-cook-in-summer.com
    What to cook in summer            | https://website/2-millions-raised-by-this-startup.com

My guess is that I would need to think about so fuzzy matching logic but I am not sure how. For the duplicates I will just use unique function.

I started using the levenshteinSim function from the RecordLinkage package, which gives a similarity score for each row but obviously as rows are not matching, the similarity score is low everywhere.

I also heard about the stringdistmatrix function from stringdist package but not sure how to use it here.

Dos this structure of "https://website/[string].com" is always present or is this just your example? If so, you can use a simple regex to remove it and do an exact match. — David Arenburg, Apr 08 '18 at 06:40
Hi, yes I know regex but no it varies a lot as there are many different websites :/ — ML_Enthousiast, Apr 08 '18 at 06:54
The you should probably make your example more representative, because as it stands now, it is pretty easy to provide a solution for the example you've provided. — David Arenburg, Apr 08 '18 at 07:44
@DavidArenburg: I posted an answer but am interested in a possibly easier solution. — Jan, Apr 09 '18 at 08:20

Jan · Answer 1 · 2018-04-09T09:24:48.017

Can surely be optimized but this might get you started:

The function matcher() converts compares both strings and yields a score
Afterwards we'll try to match the titles against matcher() and take the highest score
If a score above a threshold cannot be found, yield NA

In R:

matcher <- function(needle, haystack) {
  ### Analyzes the url part, converts them to lower case words
  ### and calculates a score to return

  # convert url
  y <- unlist(strsplit(haystack, '/'))
  y <- tolower(unlist(strsplit(y[length(y)], '[-.]')))

  # convert needle
  x <- needle

  # sum it up
  (z <- (sum(x %in% y) / length(x) + sum(y %in% x) / length(y)) / 2)
}

pairer <- function(titles, urls, threshold = 0.75) {
  ### Calculates scores for each title -> url combination
  result <- vector(length = length(titles))
  for (i in seq_along(titles)) {
    needle <- tolower(unlist(strsplit(titles[i], ' ')))
    scores <- unlist(lapply(urls, function(url) matcher(needle, url)))
    high_score <- max(scores)

    # above threshold ?
    result[i] <- ifelse(high_score >= threshold, 
                        urls[which(scores == high_score)], NA)
  }
  return(result)
}

df$guess <- pairer(df$title, df$urls)
df

This yields

                              title                                                        urls                                                       guess
1   Who will be the next president?               https://website/5-ways-to-make-a-cocktail.com          https://website/who-will-be-the-next-president.com
2         5 ways to make a cocktail          https://website/who-will-be-the-next-president.com               https://website/5-ways-to-make-a-cocktail.com
3 2 millions raised by this startup             https://website/how-did-you-find-your-house.com       https://website/2-millions-raised-by-this-startup.com
4       How did you find your house       https://website/2-millions-raised-by-this-startup.com             https://website/how-did-you-find-your-house.com
5       How did you find your house https://washingtonpost/article/latest-movies-in-theater.com             https://website/how-did-you-find-your-house.com
6          Latest movies in Theater             www.newspaper/mynews/what-to-cook-in-summer.com https://washingtonpost/article/latest-movies-in-theater.com
7            What to cook in summer       https://website/2-millions-raised-by-this-startup.com             www.newspaper/mynews/what-to-cook-in-summer.com
>

hey, sorry for my super late reply ! Thanks a lot ! I tried your functions but what I get in return is "Error in strsplit(dataf$url, "/") : non-character argument" so not sure what I am missing there... — ML_Enthousiast, Jun 11 '18 at 15:54

String fuzzy matching in dataframe

1 Answers1

Linked