How to grab subtitle from screenshot with PHP?

Question

I grab subtitle from movie screenshot. An example enter image description here

It will grab

Hey, why don't we all just relax, huh?

It has no relation with subtitle. It is screenshot. Since it is a subtitle we know the font type size etc if this will make it easier to grab.

I know most of you will say PHP OCR library but since the background is always different, it looks like it won't work.

`it looks like it won't work.` - have you tried? I mean it probably won't, but at least *try*. And the reason it probably won't work is because pretty much nothing will. Certainly nothing that has pre-built PHP support. — DaveRandom, Jan 08 '12 at 17:11
"Looks like it won't work", but have you tried it (an OCR library)? Subtitles are generally at the bottom of the scene, so you would be able to trim a lot of the picture to start with. — Alex, Jan 08 '12 at 17:12
I meant I tried this http://www.phpclasses.org/package/2874-PHP-Recognize-text-objects-in-graphical-images.html and it didn't work. That class has not updated since 2006. Is there any alternative to that class? I couldn't find. — SNaRe, Jan 08 '12 at 17:19
Doesn't matter that the background is always different, just use GD (or some other image lib) to replace any colour that isn't white (the font colour) with black. Then the background will always be the same (or close to it) and you can use OCR. — Rich Adams, Jan 08 '12 at 17:19
@nmagerko to make it easier for you to understand I will keep simple. How to grab "Hey, why don't we all just relax, huh?" text from that JPG with PHP? — SNaRe, Jan 08 '12 at 17:20
@RichAdams Good point and nice idea, but just replacing everything that isn't white is dangerous, the text is probably not pure #FFFFFF - you would have to replace everything (e.g.) less than #EEEEEE — DaveRandom, Jan 08 '12 at 17:23
@RichAdams do you know any php ocr library/class that works? Because I ended up with there is no php soluiton. The only option seems call an external OCR program with PHP. — SNaRe, Jan 08 '12 at 17:30
@SNaRe I've never tried any, so I don't know. A quick Google search brings up PhpOCR (http://www.phpkode.com/scripts/item/phpocr/) though, which seems like a decent candidate. — Rich Adams, Jan 08 '12 at 17:54
@RichAdams It is the only library PHP OCR. I tried it and it doesn't work. — SNaRe, Jan 08 '12 at 18:02

Rich Adams · Accepted Answer · 2012-01-08T17:53:38.610

The background being different shouldn't be a problem, you can just use an image library to remove anything that isn't the text colour.

Here's a quick example that gives a decent idea of what I mean, it replaces any colour lower than #f5f5f5 with #000000,

<?php
$im = imagecreatefromjpeg("img.jpg");

for ($x = imagesx($im); $x--;) 
{
    for ($y = imagesy($im); $y--;) 
    {
        $rgb = imagecolorat($im, $x, $y);

        if ((($rgb >> 16) & 0xFF) <= 245 
            && (($rgb >> 8) & 0xFF) <= 245 
            && ($rgb & 0xFF) <= 245) 
        {
            $black = imagecolorallocate($im, 0, 0, 0);
            imagesetpixel($im, $x, $y, $black);
        }
    }
}

header("Content-Type: image/jpeg");
imagejpeg($im);

Here's how the result looks:

You can probably chop most of the top part off since you know the subtitles will be at the bottom. Then just run it through an OCR library.

For PHP there's PhpOCR, although this has to be taught first with example letters.

It's probably better to use an external OCR library or command line tool and call it from PHP. For external tools, there's tesseract and ocropus (I believe ocropus is sponsored by Google too).

Thanks for that. This can be useful for pre-processing. After that I think I should work on a server side solution. PHP is not enough to do this, even though there are some libraries around. — SNaRe, Jan 08 '12 at 17:44

How to grab subtitle from screenshot with PHP?

1 Answers1