I'm trying to get pytesseract
to preserve interword spacing on an image. This is especially important in scanning poetry.
from PIL import Image
import pytesseract
img1 = Image.open(file)
custom_config = r'-c preserve_interword_spaces=1 --psm 4'
str4 = pytesseract.image_to_string(img1, config=custom_config)
I have also tried all types of psm
configurations and other config options. I'm also using the most uptodate version of pytesseract which is 0.3.7.
This question has already been asked many times. Most notably here:
Preserving Spaces in Tesseract
However, the solution is not satisfactory. It is recommended to see the following page:
https://github.com/tesseract-ocr/tesseract/issues/781
But at that page they assert that the problem has been solved here
https://github.com/tesseract-ocr/tesseract/commit/e62e8f5f802c0d8f3dd67da993327cdafaee9763
But on that page it seems that you have to upgrade to tesseract 5.0
and I can't figure out how to do that on a mac, since brew install
only installs tesseract 4.0
.
I think if I could install tesseract 5.0 then that might solve the problem.
##################
UPDATE
Ok, I have confirmation on another site that I do have to upgrade to Tesseract 5.0. brew install
does not enable that on a mac. So I guess I have to learn how to pull tesseract 5.0 straight from github which I'm not very good at doing.