I've been struggeling with parsing a company name from the domain and page title in HTML. Let's say my domain is:
http://thisismycompany.com
and the page title is:
This is an example page title | My Company
My hypothesis is that when I match the longest common substring from these, after lowercasing and removing all but alphanumeric, this is very likely to be the company name.
So a longest common substring (Link to python 3 code) would return mycompany
. How would I go about matching this substring back to the original page title so that I can retrieve the correct locations for whitespaces and upercase charachters.