1

I have the following code that allows me to find images of equal has (identical), but say I wanted to just find images with a hamming distance under a certain number, can that be incorporated into django querysets, or raw sql somehow? I don't want to fetch everything and compare with python because that's very very slow and I many many images.

Current code:

def duplicates(request):
    duplicate_images = []
    images = Image.objects.all()
    for image in images:
        duplicates = Image.objects.filter(hash=image.hash).exclude(pk=image.pk)
        for duplicate in duplicates:
            duplicate_images.append([image, duplicate])
        if len(duplicate_images) > 1000:
            break
davegri
  • 2,206
  • 2
  • 26
  • 45
  • I already have hashes in my database and implementing a function that compares the hamming distance is easy, that's not my question but thanks. – davegri Oct 04 '15 at 18:54

1 Answers1

0

Here is how to implement this using a postgres extension:

https://github.com/eulerto/pg_similarity

Installation:

$ git clone https://github.com/eulerto/pg_similarity.git
$ cd pg_similarity
$ USE_PGXS=1 make
$ USE_PGXS=1 make install
$ psql mydb
psql (9.3.5)
Type "help" for help.

mydb=# CREATE EXTENSION pg_similarity;
CREATE EXTENSION

No you can make a django queryset with a custom "WHERE" clause in order to use the hamming_text function

image = Image.objects.get(pk=1252) # the image you want to compare to
similar = Image.objects.extra(where=['hamming_text(hash,%s)>=0.88'],
                              params=[image.hash])

and voila, it works!

note: the hamming distance here is automatically normalized so 0 means completely different and 1 means identical.

davegri
  • 2,206
  • 2
  • 26
  • 45