I'm trying to do very basic normalization and realize that to a large extent url normalization is an impossible task.
Regardless, different search engines return the same search results with different schemes, hosts etc. What are the most basic parts I need to collect and can you collect more then one part with parse_url to leave only the vital parts of the url?
Results 1: http://dogs.com Result 2: http://www.dogs.com
Need t account for these kinds of inconsistencies that are possible and can be generated by different search engines