1

I'm trying to get pytesseract to preserve interword spacing on an image. This is especially important in scanning poetry.

 from PIL import Image
 import pytesseract
 img1 = Image.open(file)
 custom_config = r'-c preserve_interword_spaces=1 --psm 4'
 str4 = pytesseract.image_to_string(img1, config=custom_config)

I have also tried all types of psm configurations and other config options. I'm also using the most uptodate version of pytesseract which is 0.3.7.

This question has already been asked many times. Most notably here: Preserving Spaces in Tesseract However, the solution is not satisfactory. It is recommended to see the following page: https://github.com/tesseract-ocr/tesseract/issues/781 But at that page they assert that the problem has been solved here https://github.com/tesseract-ocr/tesseract/commit/e62e8f5f802c0d8f3dd67da993327cdafaee9763 But on that page it seems that you have to upgrade to tesseract 5.0 and I can't figure out how to do that on a mac, since brew install only installs tesseract 4.0.

I think if I could install tesseract 5.0 then that might solve the problem.

##################

UPDATE

Ok, I have confirmation on another site that I do have to upgrade to Tesseract 5.0. brew install does not enable that on a mac. So I guess I have to learn how to pull tesseract 5.0 straight from github which I'm not very good at doing.

bobsmith76
  • 160
  • 1
  • 9
  • 26

1 Answers1

0

You probably will have to clone the repository and build it.

https://github.com/tesseract-ocr/tesseract

https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos

Btw, preserve_interword_spaces works in Tesseract 4.1.1 also, if you can install that version.

nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • Actually it turns out that it can preserve interword spaces but not the space at the beginning of a line, so-called identations. I'm working on a work around now but it is not easy. – bobsmith76 Dec 26 '20 at 03:32