-1

I'm trying to do very basic normalization and realize that to a large extent url normalization is an impossible task.

Regardless, different search engines return the same search results with different schemes, hosts etc. What are the most basic parts I need to collect and can you collect more then one part with parse_url to leave only the vital parts of the url?

Results 1: http://dogs.com Result 2: http://www.dogs.com

Need t account for these kinds of inconsistencies that are possible and can be generated by different search engines

the5thace
  • 51
  • 2
  • 8
  • 1
    Question needs more clarification and some examples of "similar" URLs and what result you'd expect the normalization to produce. –  Jul 28 '13 at 14:05

1 Answers1

1

Results 1: http://dogs.com Result 2: http://www.dogs.com

These 2 aren't the same: one is the main domain, the other is a subdomain. There's no guarantee that they serve the same content.

What you're asking for is basically impossible: any part of the URL is important and changing it may result in a different page.

That said, there's a <meta> tag for canonical which indicates the normalized URL of a page. Only that URL is (somewhat) guaranteed to be correct.

Also, you could just pull the content from pages and compare them. But, again, no guarantees.

Tom van der Woerdt
  • 29,532
  • 7
  • 72
  • 105