Check duplicate content without doing a GET

Question

One of the main purposes of URL normalization is to avoid GET requests on distinct URLs that produce the exact same result.

Now, I know that you can check for the canonical tag and even compare two URL's HTML to see if they're the same, however you have to download the exact same resource twice in order to do this, beating the point I stated before.

Is there a way to check for duplicated content doing only a HEAD request? If not, is there a way to only download the <head> section of a web page without downloading the entire document?

I can think of solutions for the last one, I just wan't to know if there's a direct one.

Looking through the [Wikipedia article](http://en.wikipedia.org/wiki/URL_normalization), it seems to me like you are describing a different problem than that posed by URL normalization. A web crawler will normalize URLs to make sure that it is using the canonical version; you appear to be describing a problem whereby two different, but *already normalized* URLs on the same website can produce the same resulting output. Am I characterizing your problem correctly? — Robert Harvey, May 10 '11 at 22:54
@Robert Harvey - Correct. Normalizing URLs would normally be a way to minimize duplicated content. I'm looking for a way to avoid making two GET requests to determine if two URLs have the same exact HTML, make the URLs normalized or not. This way URL normalization in Web crawlers wont be necessary per se. I was thinking of hashing the HEAD request response, how reliable is that? — Ben, May 10 '11 at 23:29
Normalizing URLs does not minimize duplicate content; it minimizes the number of possible ways that the *same* URL can be presented to a web crawler so that it doesn't have to crawl the same page repeatedly. Having two different normalized URLs point to the same page is a different problem. — Robert Harvey, May 10 '11 at 23:35
@Robert Harvey Correct, however making two HEAD requests (<1kb) is way better than making two GET requests (HEAD+HTML - >1kb). — Ben, May 10 '11 at 23:41
Well, according to [this,](http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods) you can do a HEAD request instead of a GET request, and it should return only the page head. Now all you need to do is put some unique ID element in each page head (like a GUID or page ID number), and then you can just check the ID against your other HEAD request for duplication. — Robert Harvey, May 11 '11 at 05:10

Vineet1982 · Accepted Answer · 2011-05-20T07:39:26.927

According to the MSDN Documentation the solution for your question is as following

Dim myHttpWebRequest As HttpWebRequest = CType(WebRequest.Create(url), HttpWebRequest)
Dim myHttpWebResponse As HttpWebResponse = CType(myHttpWebRequest.GetResponse(), HttpWebResponse)
Console.WriteLine(ControlChars.Lf + ControlChars.Cr + "The following headers were received in the response")
Dim i As Integer
While i < myHttpWebResponse.Headers.Count
    Console.WriteLine(ControlChars.Cr + "Header Name:{0}, Value :{1}", myHttpWebResponse.Headers.Keys(i), myHttpWebResponse.Headers(i))
    i = i + 1
End While
myHttpWebResponse.Close()

Let me explain this code First line Creates an HttpWebRequest with the specified URL and second line and third line Displays all the Headers present in the response received from the URI and Fourth to Eighth line - The Headers property is a WebHeaderCollection. Use it's properties to traverse the collection and display each header and tenth to close the request and if you want the only head portion of the Web Page then a PHP Class is freely available at http://www.phpclasses.org/package/4033-PHP-Extract-HTML-contained-in-tags-from-a-Web-page.html

Check duplicate content without doing a GET

1 Answers1