1

I've been struggeling with parsing a company name from the domain and page title in HTML. Let's say my domain is:

http://thisismycompany.com

and the page title is:

This is an example page title | My Company

My hypothesis is that when I match the longest common substring from these, after lowercasing and removing all but alphanumeric, this is very likely to be the company name.

So a longest common substring (Link to python 3 code) would return mycompany. How would I go about matching this substring back to the original page title so that I can retrieve the correct locations for whitespaces and upercase charachters.

LexMulier
  • 273
  • 3
  • 13

2 Answers2

1

I considered whether this would be doable using regex, but I figured it would be easier to just use normal string manipulation / comparison, especially because this doesn't seem like a time-sensitive task.

def find_name(normalized_name, full_name_container):
  n = 0
  full_name = ''
  for i in range(0, len(full_name_container)):
    if n == len(normalized_name):
      return full_name

    # If the characters at the current position in both
    # strings match, add the proper case to the final string
    # and move onto the next character
    if (normalized_name[n]).upper() == (full_name_container[i]).upper():
      full_name += full_name_container[i]
      n += 1

    # If the name is interrupted by a separator, add that to the result  
    elif full_name_container[i] in ['-', '_', '.', ' ']:
      full_name += full_name_container[i]

    # If a character is encountered that is definitely not part of the name
    # Re-start the search
    else:
      n = 0
      full_name = ''

  return full_name

print(find_name('mycompany', 'Some stuff My Company Some Stuff'))

This should print out "My Company". Hard coding a list of possible items like spaces and commas that could interrupt the normalized name is probably something you'll have to improve.

CBeltz
  • 47
  • 7
  • 1
    Awesome. Thanks. This method is actually the implementation I had in mind at first, but couldn't get it working. In the meanwhile I have found a different implementation as well. I'll add it as an answer as well so you, and others can check it out. – LexMulier Jan 30 '17 at 10:45
1

I have solved it by generating a list of all possible substrings of the title. Then matching this with the match I got from the longest common substring function.

def get_all_substrings(input_string):
    length = len(input_string)
    return set([input_string[i:j+1] for i in range(length) for j in range(i,length)])

longest_substring_match = 'mycompany'
page_title = 'This is an example page title | My Company'

for substring in get_all_substrings(page_title):
    if re.sub('[^0-9a-zA-Z]+', '', substring).lower() == longest_substring_match.lower():
        match = substring
        break

print(match)

Edit: source used

Community
  • 1
  • 1
LexMulier
  • 273
  • 3
  • 13
  • 1
    I feel like this may be the better solution. It probably applies to a lot more cases than mine does. Mine may, however, be more efficient on simpler examples. – CBeltz Jan 30 '17 at 12:10
  • 1
    I agree. One more improvement might be to combine both loops and let it break when it finds a match. Would mean that it takes fewer substrings, instead of all (unless the last one is the matching one of course) – LexMulier Jan 30 '17 at 12:19