pytesseract: good OCR or good Lines - never both

Question

I'm using pytesseract (tesseract version 3.05) to OCR (Optical Character Recognition) a printed PDF bill that is digitally created. I pre-process it to remove any color and set it to pure black and white and 600 DPI. It is proprietary information so I can't post here, but trust me when I say - it is perfectly straight and very clear.

When processing the images, I've been playing with various Page Segmentation Modes (PSM).

A few PSMs (eg. 11 & 12) recognized the characters brilliantly - nearly perfect - but a single line will become multiple lines and often will get shuffled, making data parsing functionally impossible.

Other PSMs (eg. 3 & 4) keep perfect lines (which is helpful for data parsing), but the OCR is terrible (spaces are inserted, dashes become apostrophes, an 'l' will become a '1' or even 'i', etc).

I've tried all PSMs and can't find the version that allows me to keep the lines and the quality OCR.

Are there additional dials I can turn to allow me to do both, and maybe further increase the quality of the resultant text?

Code:

psm_version = 3
text        = pytesseract.image_to_string(b_w_file, lang = 'eng', config = '-psm {}'.format(psm_version))

Not impossible, but admittedly more difficult. I was hoping someone had some experience with py/tessearct, knew of this issue, and could provide guidance without an image. I'll try and post an image that redacts the relevant data soon. — elPastor, Jun 08 '19 at 20:20

score 1 · Answer 1 · answered Jun 13 '19 at 20:14

1

I'm not familiar with pytesseract but I have messed around with the C# port pretty extensively. I am feeding it .tiffs and the irony is that the higher the DPI I make the .tiff, the worse Tesseract seemingly performs. I found the sweet spot at like 119 DPI. The solution I have found that works is that I create two .tiffs, 1 high DPI which is for my output and 1 low DPI that I feed to Tesseract. I have the Tesseract iterator pass me the coordinates of the bounding boxes its find and then I use those coordinates on the higher DPI .tiff to do what I am trying to do. Its not the most efficient process so I have since moved on to other options and do not have the code anymore. Hope this helps!

answered Jun 13 '19 at 20:14

Will Jackson

58
7

Wow, 119, very odd. Do you happen to remember which PSM you use? – elPastor Jun 13 '19 at 20:16
I had to go back and look over the github for the PSM descriptions but its probably 3 or 2. I think lowering the DPI helped it identify things easier while the PSM maintained my formatting as you described above. – Will Jackson Jun 13 '19 at 20:36
Well, thank you for the suggestion, but unfortunately it's not helping. The font at 119 DPI is nearly illegible, and the OCR seems to improve linearly with an increase in DPI. Unfortunately my computer doesn't have the memory to handle anything above 600 so that dial is fully turned. – elPastor Jun 13 '19 at 21:39

pytesseract: good OCR or good Lines - never both

1 Answers1