4

I have a list of company names, and I have a list of url's mentioning company names.

The end goal is to look into the url, and find out how many of the companies on the url are in my list.

Example URL: http://www.dmx.com/about/our-clients

Each URL will be structured differently, so I don't have a good way to do a regex search and create individual strings for each company name.

I'd like build a for loop to search for each company from the list on the entire contents of the URL. But it seems like Levenshtein is better for two smaller strings, vs. a short string and a large body of text.

Where should this beginner be looking?

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
Kyle
  • 63
  • 5

2 Answers2

5

It doesn't sound to me like you need any "fuzzy" matching. And I'm assuming that when you say "url" you mean "webpage at the address pointed to by the url." Just use Python's built-in substring search functionality:

>>> import urllib2
>>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients')
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in ['Caribou Coffee', 'Express', 'Sears']:
...     if name in webpage_text:
...         print name, "found!"
... 
Caribou Coffee found!
Express found!
>>> 

If you are worried about string capitalization mismatches, just convert it all to uppercase.

>>> webpage_text = webpage_text.upper()
>>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']:
...     if name in webpage_text:
...         print name, 'found!'
... 
CARIBOU COFFEE found!
EXPRESS found!
senderle
  • 145,869
  • 36
  • 209
  • 233
  • +1 This is definitely the brute force approach and a pretty efficient one at that. – jathanism May 25 '11 at 01:07
  • 1
    that makes sense, and a good start. The reason I was thinking fuzzy matching is for instances of "Sears Inc." vs. "Sears"... etc – Kyle May 25 '11 at 03:18
  • @Kyle, I see your point -- but as long as your list of names contains the shortest unambiguous prefixes of the full company names, then it's not likely to be a big problem. So for example, if you have `'Sears'` in your list, then `'Sears, Inc.'` will also be matched. There are a few situations that could cause false negatives; but with fuzzy matching you'll probably get false positives, so I guess it depends on which of those you find more tolerable. – senderle May 25 '11 at 03:37
  • I'm working on something similar to this and was wondering if there are any concerns re: performance. Especially if I have a list of 10,000 words. Are there non-brute-force ways to approach this. Maybe I would parallelize this process? – b.lyte Jan 18 '21 at 17:05
  • @LyteSpeed I think it will still depend on other details. Searching for matches with any of 10k words would certainly be inefficient, but I'm fairly certain a single search would still complete in milliseconds, if not microseconds. So if you're only performing one of these searches every few seconds, you're still probably fine. If you are performing one of these searches every millisecond, then you may have a problem. – senderle Jan 19 '21 at 01:21
  • @LyteSpeed In that case there are many, many possible solutions, depending on your specific needs. If you only need exact substring matching, one simple solution involves using a [trie](https://en.m.wikipedia.org/wiki/Trie). If you need real fuzzy matching, things can get very complicated. You might be better off using something like [lucene](https://lucene.apache.org/pylucene/). But I've heard of people having success with [locality sensitive hashing](https://en.m.wikipedia.org/wiki/Locality-sensitive_hashing) over [character trigrams](https://ii.nlm.nih.gov/MTI/Details/trigram.shtml). – senderle Jan 19 '21 at 01:45
  • Thank you @senderle for the in-depth response! Super helpful. Will look into those. I will stick to the brute force, but good to be aware of more efficient methods. I don't need fuzz match either. – b.lyte Jan 19 '21 at 23:12
3

I would add to senderle's answer that it may make sense to normalize your names somehow (e.g., remove all special characters, and then apply it to webpage_text and your list of strings.

def normalize_str(some_str):
    some_str = some_str.lower()
    for c in """-?'"/{}[]()&!,.`""":
        some_str = some_str.replace(c,"")
    return some_str

If this isn't good enough you can go to difflib and do something like:

for client in normalized_client_names:
    closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
    if len(closest_client) > 0:
         print client_name, "found as", closest_client[0]

The arbitrary cutoff I chose (Ratcliff/Obershelp) ratio of 0.8 may be too lenient or tough; play with it a bit.

dr jimbob
  • 17,259
  • 7
  • 59
  • 81