1

I am using twython to extract images shared on twitter for #apple. The images have a unique image ID assigned by twitter but the images are pictorially same. How can I detect duplicate images ?

For Local Images, I got the solution:

Calcuate Hash for every image and then remove the duplicates.

import struct
import os
import hashlib

def jpeg(fh):
    hash = hashlib.md5()
    assert fh.read(2) == "\xff\xd8"
    while True:
        marker,length = struct.unpack(">2H", fh.read(4))
        assert marker & 0xff00 == 0xff00
        if marker == 0xFFDA: # Start of stream
            hash.update(fh.read())
            break
        else:
            fh.seek(length-2, os.SEEK_CUR)
    print "Hash: %r" % hash.digest()
jpeg(file("two.jpg")) # Gives the hash of the image

However, twitter images are stored on external server and this approach is not working ? Let say I have to obtain only unique images from twitter?

for example: twitter data gives:

http://pbs.twimg.com/media/CKwk2doVEAE-Y9g.jpg http://pbs.twimg.com/media/CKwka9fUwAEmdLr.jpg

and all three are same images.

userxxx
  • 796
  • 10
  • 18
  • You'll have to do the same thing with the remote images. You'll have to grab them via the requests library and perform the same sort of operation you are doing above. – Robert Moskal Jul 27 '15 at 17:13
  • i think you have to download the images. you can delete them after doing the process. but i am eager to see if someone comes up with a solution. – salmanwahed Jul 27 '15 at 17:26

0 Answers0