3

I've been playing around writing a scraper that scrapes Deviantart.com. It saves a copy of new images locally, and also creates a record in a Postgresql DB for the image. My problem: as new images come in, how do I know if this new image corresponds to an image I've seen before? Dupes are fairly rare on DA, but at the same time, this is an interesting problem in a more general sense.

Thoughts on ways to proceed?

Right now the Postgresql DB is populated as I scrape images, and which has a table which looks like:

CREATE TABLE Image
(
    id SERIAL PRIMARY KEY NOT NULL,
    url varchar(5000) UNIQUE NOT NULL,
    dateadded timestamp without time zone default (now() at time zone 'utc'),
    width int,
    height int
);

Where url is the link to the image as I scraped it from DA (ex: http://th05.deviantart.net/fs70/PRE/f/2014/222/2/3/sketch_dump_56_by_lilaira-d7uj8pe.png), dateadded is the datetime the scraper found the image, and width & height are the image dimensions.

I currently don't store the image itself in the database, but I do keep a local mirror -- I take the url for the image and wget -r -nc the file. So for a url: http://th05.deviantart.net/fs70/PRE/f/2014/222/2/3/sketch_dump_56_by_lilaira-d7uj8pe.png I keep a local copy at <somedir>/th05.deviantart.net/fs70/PRE/f/2014/222/2/3/sketch_dump_56_by_lilaira-d7uj8pe.png

Now, image recognition in the general case is quite hard. I want to be able to handle things like slight resizes, which I could account for by normalizing all images kept to a specific resolution, and normalize the query image to that same resolution at query time. I want to be able to handle things like change of format (PNG vs JPG vs etc) which I could do by reading an image file into a normalized format (ex: uncompressed RGB values for each pixel, though ideally some "slack" would be tolerated here).

Nice to haves (would be willing to give up for simplification/better accuracy):

  • I'd like to be able to handle cropping an image (ex: I've previously seen imageA, and somebody takes imageA and crops it and uploads it as imageB I'd like to notice that as a duplicate).
  • I'd like to be able to handle watermarking an image with a logo
  • I'd like to be able to handle cropping in a case where the new image to classify is a subimage of a previously seen image (ie - I have imageA stored, somebody takes imageA and crops it, I'd like to be able to map that cropped image to imageA)

Constraints/extra info:

  • I'm not at all interested in finding images that are different yet similar (ex: two distinct photos of the same Red Bus should be reported as two distinct images)
  • while I'm not entirely opposed to using metadata (ex: artist, image category, etc), I'd like to keep this as constrained to just the image data (EXIF data, resolution, RBG colour values) as possible.
  • an image that is sized down and appears in a new larger image I wish to consider as different. Ex: I have imageA, I resize it to 50x50, and that 50x50 grid appears in a new image, I would not consider the new image "the same" as imageA (though I suppose by the criteria outlined previously I would consider imageA a duplicate of the new image)
  • It would be nice but not required if one could detect "minor" revisions in the image (ex: a blanket change to the the gamma value in an image, etc)

Thoughts? Suggestions?

For my use case I'm far more concerned about false positives than false negatives, and as such a "fuzzy match" approach should err on the side of caution.

In case it matters I'm writing all of this in Python, though TBH I'm happy to use an alternate tech if it solves my problem elegantly/efficiently.

Adam Parkin
  • 17,891
  • 17
  • 66
  • 87
  • 1
    Why not start simple with color histograms and move to feature matching. See accepted answer http://stackoverflow.com/questions/11541154/checking-images-for-similarity-with-opencv?rq=1 – dan Aug 16 '14 at 01:41

1 Answers1

0

I would grab a small subimage somewhere not near the edges, and cross correlate this within the vicinity of its source location in your database images. You can resample it prior to cross correlation to account for small resizes, and you can choose the size of the vicinity that you match against to account for asymmetrical crops of a certain percentage.

To avoid percect fits on featureless regions (e.g. the sky) you could use local image variation as a selection criterion for the subimage location.

This would still be quite slow, so it will be necessary to use a global image metric to first select candidate duplicates from the database (e.g. the color histograms mentioned by danf).

dvhamme
  • 1,442
  • 9
  • 10