7

I tried to force tesseract to use only my words list when perform OCR. First, i copy bazaar file to /usr/share/tesseract-ocr/5/tessdata/configs/. This is my bazaar file:

load_system_dawg F
load_freq_dawg F
user_words_suffix user-words

Then, i created eng.user-words in /usr/share/tesseract-ocr/5/tessdata. This is my user-words file:

Items
VAT
included
CASH

then i perform ocr for this image by command: tesseract -l eng --oem 2 test_small.jpg stdout bazaar.

test_img

this is my result:

2 Item(s) (VAT includsd) 36,000
casH 40,000
CHANGE 4. 000

As you can see, includsd is not in my user-words file, and it should be 'included'. Besides, i got same result even without using bazaaz config in command. It looks like that my bazaar and eng.user-words config doesn't have any effect in OCR output. So how can use bazaar and user-words config, in order to get desired result ?

voxter
  • 853
  • 2
  • 14
  • 30
  • Did you ever find a solution to this? My interpretation of the documentation is the same as yours, in that you should be able to provide a 'whitelist' of words. – atlas_scoffed May 19 '22 at 10:53
  • at that time, I didn't found any solution and gave up. But it has been 2 years, so you should check documentation. It can support now @comfytoday – voxter May 20 '22 at 01:51
  • This document is somewhat relevant https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html But in terms of only using the user words supplied I can't anything like that. I don't think it's possible without compiling your own dictionary. – atlas_scoffed May 20 '22 at 03:36

2 Answers2

0

All you need to do was up-sampling the image.

If you up-sample two - times

enter image description here

Now read:

2 Item(s) (VAT included) 36,000
CASH 40,000
CHANGE 4,000

Code:

import cv2
import pytesseract

# Load the image
img = cv2.imread("4nGXo.jpg")

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Up-sample
gry = cv2.resize(gry, (0, 0), fx=2, fy=2)

# OCR
print(pytesseract.image_to_string(gry))

# Display
cv2.imshow("", gry)
cv2.waitKey(0)
Ahmet
  • 7,527
  • 3
  • 23
  • 47
  • Your approach is about resizing image in order to get better result. But i want to configure tesseract in order to use only words in my dictionary, not about getting better result – voxter Mar 29 '21 at 04:47
-1

user_words_suffix does not seem to work for --oem 2. A workaround is to use user_words_file which contains the path to your user-words file.

jtmayer
  • 401
  • 4
  • 13
  • can you post for me an example command ? i can not find your option from tesseract manual: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc – voxter Dec 12 '19 at 15:08
  • i add this option to my bazaar file, but it doesn't work: `user_words_file /usr/share/tesseract-ocr/5/tessdata/eng.user-words` – voxter Dec 12 '19 at 15:17
  • @voxter what do you mean with "it does't work" do you get any Error message? A user-words file is not a word whitelist, tesseract is not only using words from the user-words file. – jtmayer Dec 13 '19 at 05:12
  • "it doesn't work" means that i got same result before using this option – voxter Dec 13 '19 at 10:49
  • do we have any other idea ? – voxter Dec 16 '19 at 01:20
  • @voxter Like I mentioned in my comment above, the user-words file is only a hint for tesseract it does not mean that only word from this list are recognized. I think the only thing you could do is providing a better image. – jtmayer Dec 16 '19 at 07:20