I am using twython to extract images shared on twitter for #apple. The images have a unique image ID assigned by twitter but the images are pictorially same. How can I detect duplicate images ?
For Local Images, I got the solution:
Calcuate Hash for every image and then remove the duplicates.
import struct
import os
import hashlib
def jpeg(fh):
hash = hashlib.md5()
assert fh.read(2) == "\xff\xd8"
while True:
marker,length = struct.unpack(">2H", fh.read(4))
assert marker & 0xff00 == 0xff00
if marker == 0xFFDA: # Start of stream
hash.update(fh.read())
break
else:
fh.seek(length-2, os.SEEK_CUR)
print "Hash: %r" % hash.digest()
jpeg(file("two.jpg")) # Gives the hash of the image
However, twitter images are stored on external server and this approach is not working ? Let say I have to obtain only unique images from twitter?
for example: twitter data gives:
http://pbs.twimg.com/media/CKwk2doVEAE-Y9g.jpg http://pbs.twimg.com/media/CKwka9fUwAEmdLr.jpg
and all three are same images.