10

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.

Are there any hashing algorithms that work for something like this?

Jason Baker
  • 192,085
  • 135
  • 376
  • 510
  • 6
    No, that's the point of all hashing algorithmns that they change _a lot_ when the input changes only a bit. – halfdan Apr 13 '11 at 22:13
  • 2
    @halfdan - [Wikipedia would disagree with you](http://en.wikipedia.org/wiki/Hash_function#Finding_similar_records). Too bad they don't mention any algorithms for this other than acoustic fingerprinting though. – Jason Baker Apr 13 '11 at 22:43
  • possible duplicate of [Hashing Similarity](http://stackoverflow.com/questions/4834301/hashing-similarity) – Nick Johnson Apr 13 '11 at 23:45
  • Have you been able to find anything? I'm looking for exactly the same thing. – alex Jun 07 '13 at 17:16

4 Answers4

11

A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.

I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
3

This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.

The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.

You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.

Drew Hall
  • 28,429
  • 12
  • 61
  • 81
1

http://www.phash.org/ did something like this for images. The jist: Take an image, blur it, convert it to greyscale, do a discrete cosine transform, and look at just the upper left quadrant of the result (where the important information is). Then record a 0 for each value less than the average and 1 for each value more than the average. The result is pretty good for small changes.

Min-Hashing is another possibility. Find features in your text and record them as a value. Concatenate all those values to make a hash string.

For both of the above, use a vantage point tree so that you can search for near-hits.

Eyal
  • 5,728
  • 7
  • 43
  • 70
-4

I am sorry to say, but hash algorithms are precisely. Theres none capable of be tolerant of minor differences. You should take another approach.

Rafael Colucci
  • 6,018
  • 4
  • 52
  • 121
  • 1
    Ok, so perhaps it won't be *called* a hashing algorithm. But it doesn't sound like there's any confusion as to what I'm looking for. Only whether it should be called a hashing algorithm. – Jason Baker Apr 13 '11 at 22:32
  • I just answered your question. You asked "Is there a hashing algorithm that is tolerant of minor differences?" and I said not. Perhaps you should have asked another thing. – Rafael Colucci Apr 14 '11 at 00:10