I want to find identical and very similar images within a truckload of photos. To do this, I want to compare the Levenstein (or Hamming, not decided yet) distances of their perceptual hashes. To calculate these, I want to use imghash (also not a final decision). For output, imghash allows to select output format and number of bits. I assume that changing the number of bits changes accuracy/precision, but does it really? By default, the output is a 16-character hex string (Eighteen Quintillion Four Hundred Forty-Six Quadrillion.. combinations). Seems like an overkill. But is it? And if so, what is the reasonable length?
Asked
Active
Viewed 265 times
1 Answers
0
When using imghash and hamming-distance to calc similarity of images, it goes like this:
- imgHash accepts
[,bits]
as an optional argument, which is 8 by default. Longer hash does mean greater accuracy: For 'very similar' images I tested this with, their 4-bit hashes were same, but 8-bit hashes differ. - The maximum hamming distance (when images are completely different - black vs. white canvas) equals to hash length ^2. Accordingly, you need to adjust your selected threshold for image similarity.
Also:
- The selected bit length must be divisible by 4.
- When comparing the perceptual hashes, these need to be the same length.

marko-36
- 1,309
- 3
- 23
- 38