I have a dataframe containing the title of an article and the url links associated.
My problem is that the url link is not necessary in the row of the corresponding title, example:
title | urls
Who will be the next president? | https://website/5-ways-to-make-a-cocktail.com
5 ways to make a cocktail | https://website/who-will-be-the-next-president.com
2 millions raised by this startup | https://website/how-did-you-find-your-house.com
How did you find your house | https://website/2-millions-raised-by-this-startup.com
How did you find your house | https://washingtonpost/article/latest-movies-in-theater.com
Latest movies in Theater | www.newspaper/mynews/what-to-cook-in-summer.com
What to cook in summer | https://website/2-millions-raised-by-this-startup.com
My guess is that I would need to think about so fuzzy matching logic but I am not sure how. For the duplicates I will just use unique
function.
I started using the levenshteinSim
function from the RecordLinkage
package, which gives a similarity score for each row but obviously as rows are not matching, the similarity score is low everywhere.
I also heard about the stringdistmatrix
function from stringdist
package but not sure how to use it here.