If you're interested in finding near duplicates, which includes images that have been resized, you could apply difference hashing. More on hashing here. The code below is edited from Real Python blog post to make it work in python 3. It uses the hashing library linked to above that has information on different kinds of hashing. You should be able to just copy and paste the scripts and run them both directly from the command line without editing the scripts.
This first script (index.py
)creates a difference hash for each image, and then puts the hash in a shelf, or persistent dictionary that you can access later like a database, together with the image filename(s) that have that hash:
from PIL import Image
import imagehash
import argparse
import shelve
import glob
# This is just so you can run it from the command line
ap = argparse.ArgumentParser()
ap.add_argument('-d', '--dataset', required = True,
help = 'path to imput dataset of images')
ap.add_argument('-s', '--shelve', required = True,
help = 'output shelve database')
args = ap.parse_args()
# open the shelve database
db = shelve.open(args.shelve, writeback = True)
# loop over the image dataset
for imagePath in glob.glob(args.dataset + '/*.jpg'):
# load the image and compute the difference in hash
image = Image.open(imagePath)
h = str(imagehash.dhash(image))
print(h)
# extract the filename from the path and update the database using the hash
# as the key and the filename append to the list of values
filename = imagePath[imagePath.rfind('/') + 1:]
db[h] = db.get(h, []) + [filename]
db.close()
Run on the command line:
python index.py --dataset ./image_directory --shelve db.shelve
Run in Jupyter notebook
%run index.py --dataset ./image_directory --shelve db.shelve
Now everything is stored in a shelf, you can query the shelf with an image filename you want to check, and it will print out the file names of images that match, and also open the matching images (search.py
):
from PIL import Image
import imagehash
import argparse
import shelve
# arguments for command line
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
help="path to dataset of images")
ap.add_argument("-s", "--shelve", required=True,
help="output the shelve database")
ap.add_argument("-q", "--query", required=True,
help="path to the query image")
args = ap.parse_args()
# open the shelve database
db = shelve.open(args.shelve)
# Load the query image, compute the difference image hash, and grab the images
# from the database that have the same hash value
query = Image.open(args.query)
h = str(imagehash.dhash(query))
filenames = db[h]
print("found {} images".format(len(filenames)))
# loop over the images
for filename in filenames:
print(filename)
image = Image.open(args.dataset + "/" + filename)
image.show()
# close the shelve database
db.close()
Run on command line to look through image_directory
for images with the same hash as ./directory/someimage.jpg
python search.py —dataset ./image_directory —shelve db.shelve —query ./directory/someimage.jpg
Again, this is modified from Real Python
blog post linked above, which is written for python2.7, and should work out the box! Just change the command line as you need to. If I remember correctly, the python 2/3 issue was just with argparse
and not the image libraries.