30

I've used Tesseract a bit and it's results leave much to be desired. I'm currently detecting very small images (35x15, without border, but have tried adding one with imagemagick with no ocr advantage); they range from 2 chars to 5 and are a pretty reliable font, however the characters are variable enough that simply using an image size checksum or such is not going to work.

What options exist for OCR besides sticking with Tesseract or doing a complete custom training of it? Also, it would be VERY helpful if this were compatible with Heroku style hosting (at least where I can compile the bins and shove them over).

Millie Smith
  • 4,536
  • 2
  • 24
  • 60
ylluminate
  • 12,102
  • 17
  • 78
  • 152

2 Answers2

19

I have successfully used GOCR in the past for small image OCR. I would say accuracy was around 85%, after getting the grayscale options set properly, on fairly regular fonts. It fails miserably when the fonts get complicated and has trouble with multiline layouts.

Also have a look at Ocropus, which is maintained by Google. Its related to Tesseract, but from what I understand, its OCR engine is different. With just the default models included, it achieves near 99% accuracy on high-quality images, handles layout pretty well and provides HTML output with information concerning formatting and lines. However, in my experience, its accuracy is very low when the image quality is not good enough. That being said, training is relatively simple and you might want to give it a try.

Both of them are easily callable from the command line. GOCR usage is very straightforward; just type gocr -h and you should have all the information you need. Ocropus is a bit more tricky; here's a usage example, in Ruby:

require 'fileutils'
tmp = 'directory'
file = 'file.png'

`ocropus book2pages #{tmp}/out #{file}`
`ocropus pages2lines #{tmp}/out`
`ocropus lines2fsts #{tmp}/out`
`ocropus buildhtml #{tmp}/out > #{tmp}/output.html`

text = File.read("#{tmp}/output.html")
FileUtils.rm_rf(tmp)
user2398029
  • 6,699
  • 8
  • 48
  • 80
  • Very interesting! Thanks a bunch. I would be particularly interested in training. I can limit the vocabulary to about 50 "words" if vocabulary training or limiting is possible so as to give it a defined set of boundaries. – ylluminate Mar 13 '12 at 19:46
  • I recommend you have a look at [this video](http://www.youtube.com/watch?v=x6qKa3St9S8), which gives a solid explanation of how to train Ocropus. Training for GOCR remains a mystery to me; I am not even sure it is possible, and the docs are unhelpful. – user2398029 Mar 13 '12 at 19:53
  • For `ocropus`, did you use the older codebase that hasn't been updated for a few years or checkout from the repo and compile the newer updates in the works? – ylluminate Mar 13 '12 at 19:55
  • I used `port install` - not sure how old the port definitions are/were when I installed it. I don't know if it is still the case, but for a long time this was the only way to get it to compile on Mac OS X without hours of burning in dependency hell. But I'd definitely try compiling from source, if you can get it to work. – user2398029 Mar 13 '12 at 21:13
  • I'm considering working on a homebrew recipe, however it seems a bit involved. The new source release from just the past few days has an install script, but it needs some help for mac os x. `http://code.google.com/p/ocropus/source/list` and `http://code.google.com/p/ocropus/wiki/InstallTranscript` may prove some useful references. – ylluminate Mar 13 '12 at 21:18
  • I'm sure that would be welcome by many - its definitely a great tool, and should be made more accessible IMO. – user2398029 Mar 13 '12 at 21:21
  • We had some discussion about it on IRC, however it appears no one is really willing to tackle a head based formula for it. Any idea of how close we are to a full release of the .5? – ylluminate Mar 16 '12 at 00:24
  • I honestly have no idea. Sorry. – user2398029 Mar 18 '12 at 22:42
6

We use OCR XTR Lite from Vividata at my office. It uses the ScanSoft engine and is very accurate but isn't a free solution. Currently it is being scripted from bash and I process from 75,000 to 150,000 pages a day with it. Accuracy is almost perfect and it auto-rotates the images to determine the OCR orientation.