4

According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it gives no citation.

My question is: is this true? Is the current state-of-the-art so good that - for a good scan of English text - there aren't any major improvements left to be made?

Or, a less subjective form of this question is: how accurate are modern OCR systems at recognising English text for good quality scans?

Mathew Thompson
  • 55,877
  • 15
  • 127
  • 148
David Johnstone
  • 24,300
  • 14
  • 68
  • 71
  • 2
    Well, you read it on Wikipedia so it must be true. – cletus Oct 19 '09 at 09:39
  • 2
    How is this programming related? – Brian Rasmussen Oct 19 '09 at 09:40
  • 7
    Because it's a programming problem? – cletus Oct 19 '09 at 09:41
  • 1
    I also can't see how this is programming related, but more importantly, I fail to see a real question here. "How accurate is(...)" is a highly subjective question to be honest... – Razzie Oct 19 '09 at 09:52
  • Good question. Since the output to OCR is rarely useful in itself, but is an input to, usally, some text and/or layout extraction software, and often requires programmatic massaging, I count this as a programming-related question. – Charles Stewart Jan 01 '10 at 09:20

2 Answers2

5

I think that it is indeed a solved problem. Just have a look on the plethora of OCR technology articles for C#, C++, Java, etc.

Of course the article does stress that the script needs to be typewritten and clear. This makes recognition a relatively trivial task, whereas if you need to OCR scanned pages (noise) or handwriting (diffusion), it can get trickier as there are more things to tune correctly.

NT_
  • 2,660
  • 23
  • 25
3

Considered narrowly as breaking up a sufficiently high-quality 2d bitmap into rectangles, each containing an identified latin character of one of a set of well-behaved, prespecified fonts (cf. Omnifont), it is a solved problem.

Start to play about with those parameters, e.g., eccentric unknown fonts, noisy scans, asian characters, it starts become somewhat flaky or require additional input. Many well-known Ominfont systems do not handle ligatures well.

And the main problem with OCR is making sense of the output. If this was a solved problem, Google Books would give flawless results.

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85