How to normalize url's with parse_url?

Question

I'm trying to do very basic normalization and realize that to a large extent url normalization is an impossible task.

Regardless, different search engines return the same search results with different schemes, hosts etc. What are the most basic parts I need to collect and can you collect more then one part with parse_url to leave only the vital parts of the url?

Results 1: http://dogs.com Result 2: http://www.dogs.com

Need t account for these kinds of inconsistencies that are possible and can be generated by different search engines

Question needs more clarification and some examples of "similar" URLs and what result you'd expect the normalization to produce. — , Jul 28 '13 at 14:05

score 1 · Answer 1 · answered Jul 28 '13 at 14:13

Results 1: http://dogs.com Result 2: http://www.dogs.com

These 2 aren't the same: one is the main domain, the other is a subdomain. There's no guarantee that they serve the same content.

What you're asking for is basically impossible: any part of the URL is important and changing it may result in a different page.

That said, there's a <meta> tag for canonical which indicates the normalized URL of a page. Only that URL is (somewhat) guaranteed to be correct.

Also, you could just pull the content from pages and compare them. But, again, no guarantees.

How to normalize url's with parse_url?

1 Answers1