22

Okay. So I have about 250,000 high resolution images. What I want to do is go through all of them and find ones that are corrupted. If you know what 4scrape is, then you know the nature of the images I.

Corrupted, to me, is the image is loaded into Firefox and it says

The image “such and such image” cannot be displayed, because it contains errors.

Now, I could select all of my 250,000 images (~150gb) and drag-n-drop them into Firefox. That would be bad though, because I don't think Mozilla designed Firefox to open 250,000 tabs. No, I need a way to programmatically check whether an image is corrupted.

Does anyone know a PHP or Python library which can do something along these lines? Or an existing piece of software for Windows?

I have already removed obviously corrupted images (such as ones that are 0 bytes) but I'm about 99.9% sure that there are more diseased images floating around in my throng of a collection.

Nakilon
  • 34,866
  • 14
  • 107
  • 142
Joel Verhagen
  • 5,110
  • 4
  • 38
  • 47

5 Answers5

29

An easy way would be to try loading and verifying the files with PIL (Python Imaging Library).

from PIL import Image

v_image = Image.open(file)
v_image.verify()

Catch the exceptions...

From the documentation:

im.verify()

Attempts to determine if the file is broken, without actually decoding the image data. If this method finds any problems, it raises suitable exceptions. This method only works on a newly opened image; if the image has already been loaded, the result is undefined. Also, if you need to load the image after using this method, you must reopen the image file.

Jonathan Root
  • 535
  • 2
  • 14
  • 31
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • This is working for some of the the corrupted images. The advantage of this method is it is very fast. Thanks ChristopheD! – Joel Verhagen Sep 10 '09 at 13:38
  • 2
    This solution is so simple that I've wrapped it around a Python script to recursively checks for corrupt files. I'm posting here in the hope it helps anyone else: http://bitbucket.org/denilsonsa/small_scripts/src – Denilson Sá Maia Oct 11 '10 at 14:08
  • @DenilsonSá If I edit line 32: `self.globs = ['*.jpg', '*.jpe', '*.jpeg']` in jpeg_corrupt.py, to include `*.png` and `*.gif`, would the rest of the code work fine in verifying PNG and GIF images too? – galacticninja Jul 22 '14 at 08:55
  • @galacticninja: I don't know, I haven't tried with those image types. Why don't you try and report back the results? Even better you come back with a nice pull request. :) – Denilson Sá Maia Jul 22 '14 at 20:41
  • 3
    The `verify()` method won't work with all img formats. A truncated JPG file (and some other formats) can still "pass" both `verify()` and `open()` without raising exceptions. If you really want to cover everything, you can try `show()` or even better: `load()`. These will raise an exception if it fails, mostly `OSError`. – maviz Jan 07 '17 at 05:22
  • 1
    As of this comment, the verify method only works on PNG images. And it won't even cover things like IDAT errors. – CMCDragonkai Apr 20 '18 at 02:18
  • 2
    @maviz I've been working on a little general-purpose checking script (which grew out of some Python to glue `os.walk` to `unzip -t` and `unrar t`) and I can tell you that `Image.load()` won't catch all JPEG errors. In fact none of the solutions I've found (even JPEG-specific ones) will catch the example someone else posted at https://superuser.com/q/276154/48014 so I'm probably going to have to look into whether the green stripes are solid-color enough to be programmatically recognizable as being "not what a JPEG should produce". – ssokolow Jun 05 '18 at 17:53
  • @ssokolow Did you make any progress on that? – Hashim Aziz Jan 14 '20 at 05:44
  • 1
    @Hashim I had to put image-related stuff beyond `Image.load()` on hold to work on more pressing projects. – ssokolow Jan 14 '20 at 10:03
  • Update: my script (mentioned in the second comment) has moved from bitbucket to GitHub: https://github.com/denilsonsa/small_scripts/blob/master/jpeg_corrupt.py – Denilson Sá Maia Jul 01 '20 at 20:17
7

i suggest you check out imagemagick for this: http://www.imagemagick.org/

there you have a tool called identify which you can either use in combination with a script/stdout or you can use the programming interface provided

Niko
  • 6,133
  • 2
  • 37
  • 49
  • 1
    What is your (or anyone's) opinion about GraphicsMagick which is supposed to be a more stable fork of ImageMagick? – Todd Sep 09 '09 at 21:22
  • never played around with it - but i will give it a try - thanks for the info – Niko Sep 09 '09 at 21:36
  • Note that identify looks at the header only, so it should be quick, but it's not a guarantee against a corrupt image. Though I'm sure other bits of imagemagick can provide a more thorough check. – John Carter Sep 10 '09 at 00:03
  • Then again, you're scrapping from 4chan, corrupt images is kind of half the point, isn't it? (I kid) – John Carter Sep 10 '09 at 00:04
  • therefromhere is right. I tried identify out and it doesn't catch known broken images. Thanks anyways! – Joel Verhagen Sep 11 '09 at 18:22
  • As noted [here](http://unix.stackexchange.com/questions/20170/how-can-i-use-imagemagicks-identify-command-in-a-script-to-tell-if-a-jpeg-file) there is a tool called `jpeginfo`. Running `jpeginfo -c ` worked for me. – trobter Jun 06 '14 at 09:49
  • @trobter Be aware that I tried `jpeginfo` on the image with the green corruption that was uploaded on https://superuser.com/q/276154/48014 and it didn't catch it. (But then I haven't found *anything* which catches that yet, so I may have to cook something new up.) – ssokolow Jun 05 '18 at 18:12
5

In PHP, with exif_imagetype():

if (exif_imagetype($filename) === false)
{
    unlink($filename); // image is corrupted
}

EDIT: Or you can try to fully load the image with ImageCreateFromString():

if (ImageCreateFromString(file_get_contents($filename)) === false)
{
    unlink($filename); // image is corrupted
}

An image resource will be returned on success. FALSE is returned if the image type is unsupported, the data is not in a recognized format, or the image is corrupt and cannot be loaded.

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • That only reads the first few bytes looking for an image header, that's not going to be enough to confirm the image isn't corrupt. – John Carter Sep 09 '09 at 23:55
  • (though it's better than nothing, and it'd be quick) – John Carter Sep 10 '09 at 00:10
  • The advantage of this method is that it checks the entire image for corruption. It's slower, but it is more thorough. Thanks eyze! – Joel Verhagen Sep 10 '09 at 13:39
  • I tried the second one, but I keep getting errors: "libpng warning: Ignoring bad adaptive filter type", "libpng warning: Extra compressed data", "libpng warning: Extra compression data", and so on that appear to be coming from the libpng c library rather than PHP when the image is corrupted. Anyone else run into this? – SeanJA Feb 25 '10 at 16:23
  • Note that `imagecreatefromstring()` will load many types of corrupted images just fine, you will just get a partial image. I tested this with truncated JPEG files. It will usually write an error message to stderr, which is likely to go unnoticed. – jlh Jan 10 '17 at 11:16
4

If your exact requirements are that it show correctly in FireFox you may have a difficult time - the only way to be sure would be to link to the exact same image loading source code as FireFox.

Basic image corruption (file is incomplete) can be detected simply by trying to open the file using any number of image libraries.

However many images can fail to display simply because they stretch a part of the file format that the particular viewer you are using can't handle (GIF in particular has a lot of these edge cases, but you can find JPEG and the rare PNG file that can only be displayed in specific viewers). There are also some ugly JPEG edge cases where the file appears to be uncorrupted in viewer X, but in reality the file has been cut short and is only displaying correctly because very little information has been lost (FireFox can show some cut off JPEGs correctly [you get a grey bottom], but others result in FireFox seeming the load them half way and then display the error message instead of the partial image)

David
  • 24,700
  • 8
  • 63
  • 83
0

You could use imagemagick if it is available:

if you want to do a whole folder

identify "./myfolder/*" >log.txt 2>&1

if you want to just check a file:

identify myfile.jpg
SeanJA
  • 10,234
  • 5
  • 32
  • 42
  • That doesn’t seem to work. Using the `-verbose` switch does catch *some* damaged pictures, but also **drastically** increase the time to process each file. – Synetech Sep 09 '11 at 00:32
  • Depends on the corruption I guess? This will find pictures that are corrupted in the first few bytes (or wherever it identifies them as the image type). It is essentially the same as using the `exif_imagetype()` function in php – SeanJA Sep 12 '11 at 12:28
  • Unfortunately that’s not very useful. Corruption of the header/magic number is quite trivial. What’s needed is something that can check if the picture adheres to the specs of the format (which also has the side-effect of ferreting out stego pics). – Synetech Sep 12 '11 at 21:10
  • Interesting that the accepted answer is "use `exif_imagetype()`" – SeanJA Sep 13 '11 at 18:46
  • Unfortunately, sometimes people have to settle. `:-(` – Synetech Sep 13 '11 at 20:47