0

I am using python 3.5.2 and pytesseract,there is an error TypeError: a bytes-like object is required, not 'str' when I run my code,(details below):

code:File "D:/test.py"

# -*- coding: utf-8 -*-

try:
    import Image
except ImportError:
    from PIL import Image

import pytesseract


print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))
print(pytesseract.image_to_string(Image.open('d:/testimages/mobile.gif')))

error:

Traceback (most recent call last):
  File "D:/test.py", line 11, in <module>
    print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 164, in image_to_string
    errors = get_errors(error_string)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 112, in get_errors
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 112, in <genexpr>
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
TypeError: a bytes-like object is required, not 'str'

what should I do?

Edit:

I have download the training data into C:\Program Files (x86)\Tesseract-OCR\tessdata,like this:

enter image description here

and I insert the line error_string = error_string.decode("utf-8") into get_errors(),the error is like this:

Traceback (most recent call last):
  File "D:/test.py", line 11, in <module>
    print(pytesseract.image_to_string(Image.open('d:/testimages/name.gif'), lang='chi_sim'))
  File "C:\Users\dell\AppData\Local\Programs\Python\Python35\lib\site-packages\pytesseract\pytesseract.py", line 165, in image_to_string
    raise TesseractError(status, errors)
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file \\Program Files (x86)\\Tesseract-OCR\\tessdata/chi_sim.traineddata')
zwl1619
  • 4,002
  • 14
  • 54
  • 110

1 Answers1

0

This is a known bug in pytesseract, see issue #32:

Error parsing of tesseract output is brittle: a bytes-like object is required, not 'str'

and

There actually is an error in tesseract. But on the Python end the error occurs because error_string is returning a byte-literal, and the geterrors call appears to have trouble with it

The workaround is to install the training data for a given language, see Tesseract running error, or by editing site-packages\pytesseract\pytesseract.py and insert an extra line at the top of the get_errors() function (at line 109):

error_string = error_string.decode("utf-8")

The function then reads:

def get_errors(error_string):
    '''
    returns all lines in the error_string that start with the string "error"
    '''

    error_string = error_string.decode("utf-8")
    lines = error_string.splitlines()
    error_lines = tuple(line for line in lines if line.find('Error') >= 0)
    if len(error_lines) > 0:
        return '\n'.join(error_lines)
    else:
        return error_string.strip()
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • @zwl1619: I'm not *that* familiar with how pytessaract works. Fixing the encoding error shows that the training data is not installed the way it is expected to be. The error was being thrown before but because of the encoding issue you never got it. Perhaps it's some kind of permission issue? – Martijn Pieters Dec 29 '16 at 13:46