I am writing a crawler which regularly inspects a list of news websites for new articles. I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the page has changed or not:
- HTTP Status
- ETAG
- Last_modified (to combine with If-Modified-Since request)
- Expires
- Content-Length.
The excellent FeedParser.org seems to implement some of these approaches.
I am looking for an optimal code in Python (or any similar language) that makes this kind of decision. Keep in mind that header info is always provided by the server.
That could be something like :
def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
#retrieve the headers, do the magic here and return the decision
return decision