Applying user patterns in pytesseract

Question

I'm using pytesseract to try to detect certain pattern of strings in images. As far as I understand, the correct use of user patterns will help pytesseract make a better scan for a certain pattern of string. However, I can't figure out how to put that to work. This question helps clarifying that to use I must use the config argument (pytesseract.pytesseract.image_to_string(image, config='), but I didn't get how to apply that to my case.

I'm trying to find this regex pattern: \d{5}\.?\d{5} \.?\d{6} ?\d{5}\.?\d{6} ?\d ?\d{14}. How should I apply that in user patterns to help tesseract make a better OCR scanning?

jizhihaoSAMA · Answer 1 · 2020-07-03T08:42:15.720

4

It is a little hard to find it. Yes,the user-pattern in tesseract couldn't work well in the old version of tesseract.

Refer to this Pull Request on github.

And finally I found the example of how to use the user-pattern in tesseract.In your circumstance,you could try:

Firstly, make sure the version of tesseract >= 4.0.(I recommend you install tesseract 5.x,because I used 5.x in my PC)
Create a file called xxx.patterns.The content(with UNIX line endings (line-feed character) and a blank line at the end):

\d{5}\.?\d{5} \.?\d{6} ?\d{5}\.?\d{6} ?\d ?\d{14}

Then try to use:

pytesseract.image_to_string("test.png", config="--user-patterns yourpath/xxx.patterns")

Finally, it worked for me(This is an example in documentation.):

Also you could refer to this documentation.

edited Jul 03 '20 at 08:42

answered Jul 02 '20 at 16:30

jizhihaoSAMA

12,336
9
27
49

Thanks @jizhihaoSAMA. Do you know if applying user patterns helps the pytesseract AI to find those patterns or that's simply an embeded regex? – aabujamra Jul 06 '20 at 10:30
@abutremutante Not sure,But I think it should. – jizhihaoSAMA Jul 06 '20 at 12:11
the my.patterns should be a file of what format? txt? – aabujamra Jul 06 '20 at 21:24
@abutremutante On my PC, I create a txt, and rename it as `xx.patterns`. – jizhihaoSAMA Jul 07 '20 at 00:11

score 1 · Answer 2 · answered Jul 02 '20 at 07:30

This might not be the answer you are looking for, but I faced a similar problem with tesseract a few months ago. You might want to take a look at whitelisting, more specifically, whitelisting all digits. Like this,

pytesseract.image_to_string(question_img, config="-c tessedit_char_whitelist=0123456789. -psm 6")

This however did not work for me, so I ended up using opencv knn, this does mean you need to know where each char is located though... First I stored some images of the characters I wanted to recognize. And added those detections to a temporary file:

frame[y:y + h, x:x + w].copy().flatten()

After labeling all those detections I trained them using the previously mentioned knn.

network = cv2.ml.KNearest_create()
network.train(data, cv2.ml.ROW_SAMPLE, labels)
network.save('pattern')

Now all chars can be analysed using.

chars = [
    frame[y1:y1 + h, x1:x1 + w].copy().flatten(), #char 1
    frame[y2:y2 + h, x2:x2 + w].copy().flatten(), #char 2
    frame[yn:yn + h, xn:xn + w].copy().flatten(), #char n
]

output = ''
network = cv2.ml.KNearest_create()
network.load('pattern')
for char in chars:
    ret, results, neighbours, dist = network.findNearest([char.astype(np.float32)], 3)
    output = '{0}'.format(result)

After this you can just do your regex on your string. Total training and labeling only took me something like 2 hours so should be quite doable.

score 0 · Answer 3 · answered Jun 16 '23 at 15:12

TesseractOCR doesn't support regex expressions. Patterns use their proprietary format for pattern recognition, which is a much diminished subset of regex syntax. see: https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h above: 'read_pattern_list()'

Applying user patterns in pytesseract

3 Answers3