Is OCR no longer an issue?

Question

According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it gives no citation.

My question is: is this true? Is the current state-of-the-art so good that - for a good scan of English text - there aren't any major improvements left to be made?

Or, a less subjective form of this question is: how accurate are modern OCR systems at recognising English text for good quality scans?

I also can't see how this is programming related, but more importantly, I fail to see a real question here. "How accurate is(...)" is a highly subjective question to be honest... — Razzie, Oct 19 '09 at 09:52
Good question. Since the output to OCR is rarely useful in itself, but is an input to, usally, some text and/or layout extraction software, and often requires programmatic massaging, I count this as a programming-related question. — Charles Stewart, Jan 01 '10 at 09:20

score 5 · Answer 1 · answered Oct 20 '09 at 10:02

I think that it is indeed a solved problem. Just have a look on the plethora of OCR technology articles for C#, C++, Java, etc.

Of course the article does stress that the script needs to be typewritten and clear. This makes recognition a relatively trivial task, whereas if you need to OCR scanned pages (noise) or handwriting (diffusion), it can get trickier as there are more things to tune correctly.

score 3 · Accepted Answer · answered Jan 01 '10 at 09:36

Considered narrowly as breaking up a sufficiently high-quality 2d bitmap into rectangles, each containing an identified latin character of one of a set of well-behaved, prespecified fonts (cf. Omnifont), it is a solved problem.

Start to play about with those parameters, e.g., eccentric unknown fonts, noisy scans, asian characters, it starts become somewhat flaky or require additional input. Many well-known Ominfont systems do not handle ligatures well.

And the main problem with OCR is making sense of the output. If this was a solved problem, Google Books would give flawless results.

Is OCR no longer an issue?

2 Answers2