I'd like to find a tool that does a good job of fuzzy matching URLs that are the same expecting extra parameters. For instance, for my use case, these two URLs are the same:
atest = (http://www.npr.org/templates/story/story.php?storyId=4231170', 'http://www.npr.org/templates/story/story.php?storyId=4231170&sc=fb&cc=fp)
At first blush, fuzz.partial_ratio
and fuzz.token_set_ratio
fuzzywuzzy get the job done with a 100 threshold:
ratio = fuzz.ratio(atest[0], atest[1])
partialratio = fuzz.partial_ratio(atest[0], atest[1])
sortratio = fuzz.token_sort_ratio(atest[0], atest[1])
setratio = fuzz.token_set_ratio(atest[0], atest[1])
print('ratio: %s' % (ratio))
print('partialratio: %s' % (partialratio))
print('sortratio: %s' % (sortratio))
print('setratio: %s' % (setratio))
>>>ratio: 83
>>>partialratio: 100
>>>sortratio: 83
>>>setratio: 100
But this approach fails and returns 100 in other cases, like:
atest('yahoo.com','http://finance.yahoo.com/news/earnings-preview-monsanto-report-2q-174000816.html')
The URLs in my data and the parameters added vary a great deal. I interested to know if anyone has a better approach using url parsing or similar?