21

I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/, but it contains a .exe executable file.

I tried changed the code to use wine, and it really works, but it's too slow and really not a good idea.

Is there any Linux alternatives that as easy-to-use as it?

Felix Yan
  • 14,841
  • 7
  • 48
  • 61
  • 1
    Why closing the question? It surely fit in "software tools commonly used by programmers" and "practical, answerable problems that are unique to the programming profession" defined in http://stackoverflow.com/help/on-topic – Felix Yan Jul 26 '13 at 08:31

5 Answers5

17

You can just wrap tesseract in a function:

import os
import tempfile
import subprocess

def ocr(path):
    temp = tempfile.NamedTemporaryFile(delete=False)

    process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    process.communicate()

    with open(temp.name + '.txt', 'r') as handle:
        contents = handle.read()

    os.remove(temp.name + '.txt')
    os.remove(temp.name)

    return contents

If you want document segmentation and more advanced features, try out OCRopus.

Blender
  • 289,723
  • 53
  • 439
  • 496
  • That code is wrong, `handle.close()` calls `str.close()` which doesn't exist. – OneOfOne Apr 27 '11 at 05:59
  • Gotcha. I re-wrote it a little while I was writing this, as I had two `.close()` functions which were taking up space. Not sure if it's bad to omit them, but I've heard that Python cleans up all by itself. – Blender Apr 27 '11 at 06:01
  • yes, the pytesser is also something like your function that shells the tesseract executable file and gets stdout back. But tessetact.exe uses a temp file that stop me from using multi-threading(file conflict) – Felix Yan Apr 27 '11 at 06:58
  • Hmm, how would you multi-thread? It's not supported in Tesseract (AFAICT, but the svn version of Tesseract worked wonders for me with it's layout-analysis), and since I use a tempfile, it's unique and won't conflict. – Blender Apr 27 '11 at 07:00
  • Maybe a different method would work? What are you trying to accomplish through this module? – Blender Apr 27 '11 at 07:02
  • I have lots of different image files and using threading.Thread and Queue to multi-thread. Thank you and I'll try your func with svn version of tesseract. – Felix Yan Apr 27 '11 at 07:07
  • Could you show an example of how to use `ocr()` function please – david_adler Oct 14 '13 at 12:36
  • 1
    @david_adler: `ocr('path/to/your/image.png')`? – Blender Oct 14 '13 at 17:49
11

In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.

ABBYY comand line OCR utility: http://ocr4linux.com/en:start

It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.

Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

Disclaimer: I work for ABBYY

Tomato
  • 2,169
  • 15
  • 24
  • I'd argue that Tesseract has better accuracy than ABBY FineReader, as I've used both to digitize a few hundred books. – Blender Apr 27 '11 at 22:54
  • 1
    @Blender: Here's comparison of several engines: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison You can see that ABBYY far more accurate in general, giving 100% accuracy on most of the samples, but still there are areas where it is worse than tesseract. My experience shows same: ABBYY in general is indeed way more accurate, and (what is most important for me) works well even without training. Did you train tesseract for a document? And did you also train ABBYY or compared with just as it is? – Tomato Apr 28 '11 at 08:34
  • No training for Tesseract at all. But I'm using the `svn` version of Tesseract, which is much different that the normal stable build. – Blender May 02 '11 at 17:47
  • 2
    I tested `tesseract` with those images, and it has only 2 characters which differ from the original. ABBYY works for printed text well, but for cruddy typed text that I took a picture of, Tesseract works a bit better. Other than that, I don't have the money to buy ABBYY ;) – Blender May 02 '11 at 17:52
6

python tesseract

http://code.google.com/p/python-tesseract

import cv2.cv as cv
import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()
FreeToGo
  • 360
  • 5
  • 8
1

You should try the excellent scikits.learn libraries for machine learning. You can find two codes that are ready to run here and here.

Jaime Ivan Cervantes
  • 3,579
  • 1
  • 40
  • 38
0

You have a bunch of options here.

One way, as others pointed out is to use tesseract. Looks like there are a bunch of wrappers by now, so best way is to do a quick pypi search for it. The most used ones these days are:

Another useful site for finding similar engines is alternative.to. A few linux based systems according to them are:

  • ABBYY
  • Tesseract
  • CuneiForm
  • Ocropus
  • GOCR
Vajk Hermecz
  • 5,413
  • 2
  • 34
  • 25