3

What if I only need to download the page if it has not changed since the last download? What is the best way? can I get the size of the page first, then compare the decide if it has changed, if so, I ask for download else skip?

I plan to use (python) mechanize.

Val Neekman
  • 17,692
  • 14
  • 63
  • 66

2 Answers2

5

the request should be a HEAD, not a GET:

9.4 HEAD

The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.

The response to a HEAD request MAY be cacheable in the sense that the information contained in the response MAY be used to update a previously cached entity from that resource. If the new field values indicate that the cached entity differs from the current entity (as would be indicated by a change in Content-Length, Content-MD5, ETag or Last-Modified), then the cache MUST treat the cache entry as stale.

See here How can I perform a HEAD request with the mechanize library?

Community
  • 1
  • 1
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
0

yes you can get more information in python mechanize by setting like this

br = mechanize.Browser()
br.set_debug_http(True)
br.set_debug_redirects(True)
... Your code here ...

by doing this, you can get valuable header information of the page

Yuda Prawira
  • 12,075
  • 10
  • 46
  • 54