I've been playing around writing a scraper that scrapes Deviantart.com. It saves a copy of new images locally, and also creates a record in a Postgresql DB for the image. My problem: as new images come in, how do I know if this new image corresponds to an image I've seen before? Dupes are fairly rare on DA, but at the same time, this is an interesting problem in a more general sense.
Thoughts on ways to proceed?
Right now the Postgresql DB is populated as I scrape images, and which has a table which looks like:
CREATE TABLE Image
(
id SERIAL PRIMARY KEY NOT NULL,
url varchar(5000) UNIQUE NOT NULL,
dateadded timestamp without time zone default (now() at time zone 'utc'),
width int,
height int
);
Where url is the link to the image as I scraped it from DA (ex: http://th05.deviantart.net/fs70/PRE/f/2014/222/2/3/sketch_dump_56_by_lilaira-d7uj8pe.png), dateadded
is the datetime
the scraper found the image, and width
& height
are the image dimensions.
I currently don't store the image itself in the database, but I do keep a local mirror -- I take the url for the image and wget -r -nc
the file. So for a url: http://th05.deviantart.net/fs70/PRE/f/2014/222/2/3/sketch_dump_56_by_lilaira-d7uj8pe.png I keep a local copy at <somedir>/th05.deviantart.net/fs70/PRE/f/2014/222/2/3/sketch_dump_56_by_lilaira-d7uj8pe.png
Now, image recognition in the general case is quite hard. I want to be able to handle things like slight resizes, which I could account for by normalizing all images kept to a specific resolution, and normalize the query image to that same resolution at query time. I want to be able to handle things like change of format (PNG vs JPG vs etc) which I could do by reading an image file into a normalized format (ex: uncompressed RGB values for each pixel, though ideally some "slack" would be tolerated here).
Nice to haves (would be willing to give up for simplification/better accuracy):
- I'd like to be able to handle cropping an image (ex: I've previously seen
imageA
, and somebody takesimageA
and crops it and uploads it asimageB
I'd like to notice that as a duplicate). - I'd like to be able to handle watermarking an image with a logo
- I'd like to be able to handle cropping in a case where the new image to classify is a subimage of a previously seen image (ie - I have
imageA
stored, somebody takesimageA
and crops it, I'd like to be able to map that cropped image toimageA
)
Constraints/extra info:
- I'm not at all interested in finding images that are different yet similar (ex: two distinct photos of the same Red Bus should be reported as two distinct images)
- while I'm not entirely opposed to using metadata (ex: artist, image category, etc), I'd like to keep this as constrained to just the image data (EXIF data, resolution, RBG colour values) as possible.
- an image that is sized down and appears in a new larger image I wish to consider as different. Ex: I have
imageA
, I resize it to 50x50, and that 50x50 grid appears in a new image, I would not consider the new image "the same" asimageA
(though I suppose by the criteria outlined previously I would considerimageA
a duplicate of the new image) - It would be nice but not required if one could detect "minor" revisions in the image (ex: a blanket change to the the gamma value in an image, etc)
Thoughts? Suggestions?
For my use case I'm far more concerned about false positives than false negatives, and as such a "fuzzy match" approach should err on the side of caution.
In case it matters I'm writing all of this in Python, though TBH I'm happy to use an alternate tech if it solves my problem elegantly/efficiently.