0

Imagine a short video clip like this: black background, a line of white text in the center that gets gradually filled with red color, not only letter by letter, but each individual letter is filled gradually. Here is a simplified image that illustrates this:

enter image description here

(There is a bunch of frames in between, but they are omitted for simplicity.)

Thus, after some time (like 10 seconds) the whole string will be red.

Now the task I have to solve:

  • I have to recognize the initial string, thus I should get "HELLO WORLD" as the result.
  • Not only that. For every letter I have to find out at which point it stars getting filled, and at which point it is completely filled.

The output might be like this:

H,0ms,1000ms E,1000ms, 1500ms L,1500,2500ms L,2500ms,3500ms O,3500ms,4000ms

... and so on.

The speed may vary for different letters. The typeface and font size is always the same. The character set includes lower- and uppercase letters.

I considered two approaches: OCR recognition or neural network. I have little experience with either.

I assume that the OCR approach will let me easily recognize the text. But how do I recognize not filled vs. filled letters?

The neural network approach will probably let me recognize both unfilled/filled letters, but for this I have to split the image into separate letters, which might be a complex task in itself.

Are there any other options available? Or given the two options above, which one would you recommend and how would you work around the issues outlined for the two approaches?

Pavel Bastov
  • 6,911
  • 7
  • 39
  • 48

3 Answers3

0

While using a specifically tuned OCR or other kind of image recognition algorithm would be the most effective approach, it would probably involve a significant amount of work on your part to get right.

Instead of doing that, how about using a simple image filter to split each frame into two layers? One layer with all white parts turned into black, and one with all red parts turned into black. In your third example frame, the first layer would only contain a red H in a black background and the second would contain a white ELLO WORLD in a black background.

You can then use an OCR algorithm to get the letters from each layer, clearly separated into a filled and unfilled group. Using OCR on the original frame would give you the whole text, so that it would be easy to handle partially filled letters showing up (mangled) in both layers.

Depending on your performance requirements, this might be enough to do what you need, despite having to run the OCR algorithm three times as often...

thkala
  • 84,049
  • 23
  • 157
  • 201
0

You may want to try out with Tesseract OCR engine and work with character (symbol-level) confidence values (see examples). As the color/filling of the characters changes, it likely affects the confidence also.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
0

I tested your image in a powerful commercial OCR application. For simplicity, I tested all three frames at once, since that has no effect on reading all at once, or one at a time. Segmentation handles that automatically and reads each zone separately. The result looks like this (ignore the blue highlight): enter image description here

What you see are actual digital characters/strings. The software detected white text on black background. Aka inverted text.

My concern before the test, and confirmed by the test, was about those partially filled characters. OCR will read text and anything that looks like text. You may get partial character reads like I (see above in 2nd frame), semicolon (partial C), periods (partial L), V (partial W), etc. As long as you filter for those I suppose....

I believe OCR is an easier option for a quick prototype or a one-time need, but it may not be exactly precise to millisecond and may produce some raw results that need to be "post-processed" with additional decision making and filters.

Completely reliable method would be image and pixel analysis. As you said, there are a few additional steps needed before actual pixel analysis begins.

So in the end, I think both are necessary for an elegant and reliable solution.

How about this:

  1. Use first frame (one with no red pixels) to get the whole string using OCR, plus bounding box of coordinates for each character. (You did not stay, but it seems that position of characters stays exactly the same from frame to frame.) The OCR system I tested provides you with exact coordinates of each character in XML. Other OCR should be able to do it also.

  2. Starting from the left, analyze each character bounding box (treating each like a small separate image a few pixels wide and tall, but use exact coordinates) for presence of at least one red pixel. Boom - that's your fill start for that character.

  3. Analyze the same box for the last white pixel. Boom - that's your end of fill for that character.

Repeat for all characters.

Each step uses relatively simple available tools, simple well-defined algorithms, and should produce high consistency and reliability.

Ilya Evdokimov
  • 1,374
  • 11
  • 14