10

I'm using Python 3.6 in Windows 10 and have Pytesseract already installed but I found in a code Tesserocr which by the way I can't install. What is the difference?

Soufiane S
  • 197
  • 1
  • 4
  • 16
  • In addition to [this answer](https://stackoverflow.com/a/56387215/11630056) in Tesserocr there is no support for Python 3.8 (April 2020). – Gustaw Solski Apr 01 '20 at 12:58

3 Answers3

31

From my experience Tesserocr is much faster than Pytesseract.

Tesserocr is a python wrapper around the Tesseract C++ API. Whereas pytesseract is a wrapper around the tesseract-ocr CLI.

With Tesserocr you can pre-load the model at the beginning or your program (which is called memoization), and run the model separately (for example in loops to process videos).

With pytesseract, each time you call image_to_string function, it loads the model and process the image, which makes it slower for repeated calls.

To install tesserocr I just typed in the terminal pip install tesserocr.

To use tesserocr

import tesserocr
from PIL import Image
api = tesserocr.PyTessBaseAPI()
pil_image = Image.open('sample.jpg')
api.SetImage(pil_image)
text = api.GetUTF8Text()

To install pytesseract : pip install pytesseract.

To run it :

import pytesseract
import cv2
image = cv2.imread('sample.jpg')
text = pytesseract.image_to_string(image)  
mirekphd
  • 4,799
  • 3
  • 38
  • 59
Houssam ASSANY
  • 413
  • 4
  • 6
  • 1
    For pytesseract no PIL needed? – Timo Nov 11 '20 at 19:14
  • `cv2 has no imread member`, but I get info when hovering over imread() in VSCode. Maybe cv2 is not installed, but I `pip`ed it and python -c 'import cv2' shows no error, just nothing. – Timo Nov 11 '20 at 19:52
  • another speed impact is that pytesseract (as of the time writing this comment) always writes images to disk instead of directly piping to tesseract, see https://github.com/madmaze/pytesseract/issues/172 – j-hap Mar 12 '21 at 07:34
4

Pytesseract is a python "wrapper" for the tesseract binary. It offers only the following functions, along with specifying flags (man page):

  • get_tesseract_version Returns the Tesseract version installed in the system.
  • image_to_string Returns the result of a Tesseract OCR run on the image to string
  • image_to_boxes Returns result containing recognized characters and their box boundaries
  • image_to_data Returns result containing box boundaries, confidences, and other information. Requires Tesseract 3.05+. For more information, please check the Tesseract TSV documentation
  • image_to_osd Returns result containing information about orientation and script detection.

See the project description for more information.

On the other hand, tesserocr interfaces directly with Tesseract's C++ API (APIExample) which is much more flexible/complex and offers advanced features.

qwr
  • 9,525
  • 5
  • 58
  • 102
3

pytesseract is only a binding for tesseract-ocr for Python. So, if you want to use tesseract-ocr in python code without using subprocess or os module for running command line tesseract-ocr commands, then you use pytesseract. But, in order to use it, you have to have a tesseract-ocr installed.

You can think of it this way. You need a tesseract-ocr installed because it's the program that actually runs and does the OCR. But, if you want to run it from python code as a function, you install pytesseract package that enables you to do that. So when you run pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'), it calls the tesseract-ocr with the provided arguments. The results are the same as running tesseract test-european.jpg -l fra. So, you get the ability to call that from the code, but in the end, it still has to run the tesseract-ocr to do the actual OCR.

Novak
  • 2,143
  • 1
  • 12
  • 22
  • Thanks a lot, now I understand... Do you have any idea on how to install tesserocr? If you have it installed what are the steps you followed and what version of Visual Studio you are using. Thank you again! – Soufiane S Feb 19 '19 at 09:25
  • I have already installed tesseract for Windows, I need to install [tesserocr](https://pypi.org/project/tesserocr/) for python but it fails... – Soufiane S Feb 19 '19 at 09:36
  • 1
    Then download desired version from [here](https://github.com/simonflueckiger/tesserocr-windows_build/releases) and the just run `pip install .whl` – Novak Feb 19 '19 at 09:38
  • 6
    This does not answer what is tesserocr, which is different from tesseract-ocr, as explained in https://stackoverflow.com/a/56387215/4974791 – Guillermo González de Garibay Apr 22 '20 at 11:22