10

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.

I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.

So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.

Savad KP
  • 1,625
  • 3
  • 28
  • 40
  • Related: http://stackoverflow.com/q/4618530 – Basilevs Dec 28 '15 at 08:48
  • Are you sure you need to compare the complete page sources and not, a specific part that you expect to be updated? – alecxe Dec 28 '15 at 14:18
  • I want to compare complete page. – Savad KP Dec 29 '15 at 03:53
  • Not sure how complicate of structure in your page. If you concern about some text that you want to ignore such as date, how about if have some html tag like
    then remove it before you do hashing so you may have more better data to compare.
    – thep Jan 04 '16 at 05:00

6 Answers6

4

There is no universal solution.

  • Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
  • Use RSS when possible.
  • Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
  • Only hash interesting elements of page (build site-specific model) excluding volatile parts
  • Hash whole content (useless for dynamic pages)
Basilevs
  • 22,440
  • 15
  • 57
  • 102
2

Safest solution:

download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.

Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.

Using Head

Request page using HEAD verb and check the Header Tags:

  • Last-Modified: Server should provide last time page generated or Modified.
  • ETag: A checksum-like value which is defined by server and should change as soon as content changed.

Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch

Using GET

Request page using GET verb and using conditional Header Tags: * If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified

Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.

Finally, maybe mix of above solution is optimum way for doing such action.

Ali Nikneshan
  • 3,500
  • 27
  • 39
2

If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.

Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?

That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.

Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).

An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.

You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".

You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
1

Hope this helps.

store the html files -- two versions..

one was the html which was taken before an hour. -- first.html

second is the html which was taken now -- second.html

Run the command :

$ diff first.html second.html > diffs.txt

If the diffs has some text then the file is changed.

SuperNova
  • 25,512
  • 7
  • 93
  • 64
  • I think its not a good idea. Saving large html file to our database is much expensive. That is why I am thinking about **hashes**. – Savad KP Nov 04 '15 at 09:12
  • can u compress the file and then store the path of the file in the DB.? will this be helpful? – SuperNova Nov 04 '15 at 09:23
  • saving whole html file (_with or without compressing_) is much larger than **hash** values. We can very easily convert a file into **hash**. Also we can easily store and compare hex values obtained from **hash**. – Savad KP Nov 04 '15 at 09:38
  • You can safely assume HTML always change on modern sites, and oldest ones keep If-modifed-since. In other words nothing can be improved here. – Basilevs Dec 28 '15 at 05:19
  • You don't even need an external tool for this, by the way. Python has `difflib`: https://docs.python.org/2.7/library/difflib.html – Martin Valgur Dec 30 '15 at 19:28
  • _"Saving large html file to our database"_ <-- this is your problem right here. You don't ever want to do this, unless you are running some NLP type work - and if so, you would use a database designed for this kind of work (something like couch or mongo). As you are only concerned with changes, the above is a good option. – Burhan Khalid Jan 04 '16 at 04:47
1

Use git, which has excellent reporting capabilities on what has changed between two states of a file; plus you won't eat up disk space as git manages the deltas for you.

You can even tell git to ignore "trivial" changes, such as adding and removing of whitespace characters to further optimize the search.

Practically what this comes down to is parsing the output of git diff -b --numstat HEAD HEAD^; which roughly translates to "find me what has changed in all the files, ignoring any whitespace changes, between the current state, and the previous state"; which will result in output like this:

2       37      en/index.html

2 insertions were made, 37 deletions were made to en/index.html

Next you'll have to do some experimentation to find a "threshold" at which you would consider a change significant in order to process the files further; this will take time as you will have to train the system (you can also automate this part, but that is another topic all together).

Unless you have a very good reason to do so - don't use your traditional, relational database as a file system. Let the operating system take care of files, which its very good at (something a relational database is not designed to manage).

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
0

You should do an HTTP HEAD request (so you don't download the file) and look at the "Last-modified" header in the response.

import requests

response = requests.head(url)
datetime_str = response.headers["last-modified"]

And keep checking if that field changes in a while loop and compare the datetime difference.

I did a little program on Python to do that:

https://github.com/javierdechile/check_updates_http

Javier Giovannini
  • 2,302
  • 1
  • 19
  • 21