If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.
Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?
That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.
Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?
You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).
An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.
You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>"
.
You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.