7

I'm using pytesseract to try to detect certain pattern of strings in images. As far as I understand, the correct use of user patterns will help pytesseract make a better scan for a certain pattern of string. However, I can't figure out how to put that to work. This question helps clarifying that to use I must use the config argument (pytesseract.pytesseract.image_to_string(image, config='), but I didn't get how to apply that to my case.

I'm trying to find this regex pattern: \d{5}\.?\d{5} \.?\d{6} ?\d{5}\.?\d{6} ?\d ?\d{14}. How should I apply that in user patterns to help tesseract make a better OCR scanning?

aabujamra
  • 4,494
  • 13
  • 51
  • 101

3 Answers3

4

It is a little hard to find it. Yes,the user-pattern in tesseract couldn't work well in the old version of tesseract.

Refer to this Pull Request on github.

And finally I found the example of how to use the user-pattern in tesseract.In your circumstance,you could try:

  1. Firstly, make sure the version of tesseract >= 4.0.(I recommend you install tesseract 5.x,because I used 5.x in my PC)

  2. Create a file called xxx.patterns.The content(with UNIX line endings (line-feed character) and a blank line at the end):

\d{5}\.?\d{5} \.?\d{6} ?\d{5}\.?\d{6} ?\d ?\d{14}
 
  1. Then try to use:
pytesseract.image_to_string("test.png", config="--user-patterns yourpath/xxx.patterns")

Finally, it worked for me(This is an example in documentation.): enter image description here


Also you could refer to this documentation.

jizhihaoSAMA
  • 12,336
  • 9
  • 27
  • 49
1

This might not be the answer you are looking for, but I faced a similar problem with tesseract a few months ago. You might want to take a look at whitelisting, more specifically, whitelisting all digits. Like this,

pytesseract.image_to_string(question_img, config="-c tessedit_char_whitelist=0123456789. -psm 6")

This however did not work for me, so I ended up using opencv knn, this does mean you need to know where each char is located though... First I stored some images of the characters I wanted to recognize. And added those detections to a temporary file:

frame[y:y + h, x:x + w].copy().flatten()

After labeling all those detections I trained them using the previously mentioned knn.

network = cv2.ml.KNearest_create()
network.train(data, cv2.ml.ROW_SAMPLE, labels)
network.save('pattern')

Now all chars can be analysed using.

chars = [
    frame[y1:y1 + h, x1:x1 + w].copy().flatten(), #char 1
    frame[y2:y2 + h, x2:x2 + w].copy().flatten(), #char 2
    frame[yn:yn + h, xn:xn + w].copy().flatten(), #char n
]

output = ''
network = cv2.ml.KNearest_create()
network.load('pattern')
for char in chars:
    ret, results, neighbours, dist = network.findNearest([char.astype(np.float32)], 3)
    output = '{0}'.format(result)

After this you can just do your regex on your string. Total training and labeling only took me something like 2 hours so should be quite doable.

Jop Knoppers
  • 676
  • 1
  • 10
  • 22
0

TesseractOCR doesn't support regex expressions. Patterns use their proprietary format for pattern recognition, which is a much diminished subset of regex syntax. see: https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h above: 'read_pattern_list()'