Fuzzy URL matching in Python

Question

I'd like to find a tool that does a good job of fuzzy matching URLs that are the same expecting extra parameters. For instance, for my use case, these two URLs are the same:

atest = (http://www.npr.org/templates/story/story.php?storyId=4231170', 'http://www.npr.org/templates/story/story.php?storyId=4231170&sc=fb&cc=fp)

At first blush, fuzz.partial_ratio and fuzz.token_set_ratio fuzzywuzzy get the job done with a 100 threshold:

ratio = fuzz.ratio(atest[0], atest[1])
partialratio = fuzz.partial_ratio(atest[0], atest[1])
sortratio = fuzz.token_sort_ratio(atest[0], atest[1])
setratio = fuzz.token_set_ratio(atest[0], atest[1])
print('ratio: %s' % (ratio))
print('partialratio: %s' % (partialratio))
print('sortratio: %s' % (sortratio))
print('setratio: %s' % (setratio))
>>>ratio: 83
>>>partialratio: 100
>>>sortratio: 83
>>>setratio: 100

But this approach fails and returns 100 in other cases, like:

atest('yahoo.com','http://finance.yahoo.com/news/earnings-preview-monsanto-report-2q-174000816.html')

The URLs in my data and the parameters added vary a great deal. I interested to know if anyone has a better approach using url parsing or similar?

Do I understand correctly that you just want to check if there are extra parameters? In case of two URLs with a common parameter and a different one, are they equal or not? And which one do you use as a base for comparison? — ChatterOne, Apr 30 '18 at 06:05

score 0 · Answer 1 · answered Apr 30 '18 at 08:29

If all you want is check that all query parameters in the first URL are present in the second URL, you can do it in a simpler way by just doing set difference:

import urllib.parse as urlparse

base_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170'
check_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170&sc=fb&cc=fp'

base_url_parameters = set(urlparse.parse_qs(urlparse.urlparse(base_url).query).keys())
check_url_parameters = set(urlparse.parse_qs(urlparse.urlparse(check_url).query).keys())

print(base_url_parameters - check_url_parameters)

This will return an empty set, but if you change the base url to something like

base_url = 'http://www.npr.org/templates/story/story.php?storyId=4231170&test=1'

it will return {'test'}, which means that there are extra parameters in the base URL that are missing from the second URL.

Fuzzy URL matching in Python

1 Answers1