0

I want to find identical and very similar images within a truckload of photos. To do this, I want to compare the Levenstein (or Hamming, not decided yet) distances of their perceptual hashes. To calculate these, I want to use imghash (also not a final decision). For output, imghash allows to select output format and number of bits. I assume that changing the number of bits changes accuracy/precision, but does it really? By default, the output is a 16-character hex string (Eighteen Quintillion Four Hundred Forty-Six Quadrillion.. combinations). Seems like an overkill. But is it? And if so, what is the reasonable length?

marko-36
  • 1,309
  • 3
  • 23
  • 38

1 Answers1

0

When using imghash and hamming-distance to calc similarity of images, it goes like this:

  • imgHash accepts [,bits] as an optional argument, which is 8 by default. Longer hash does mean greater accuracy: For 'very similar' images I tested this with, their 4-bit hashes were same, but 8-bit hashes differ.
  • The maximum hamming distance (when images are completely different - black vs. white canvas) equals to hash length ^2. Accordingly, you need to adjust your selected threshold for image similarity.

Also:

  • The selected bit length must be divisible by 4.
  • When comparing the perceptual hashes, these need to be the same length.
marko-36
  • 1,309
  • 3
  • 23
  • 38