OCR detecting E as £

Question

I am using pytesseract (version 5 of tesseract) to scan an image. I have changed image to black and white to remove the noise but still E is being detected as £196893 .
Also tried setting the language, dpi and psm values which has been suggested by most of people. Below are the settings I am using now. Please suggest.

pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l eng")

Once of sample picture is shown below. For some samples it is working fine but for some samples it is giving such strange characters.

score 0 · Answer 1 · answered Feb 02 '20 at 17:45

A solution to overcome this issue is to limit the characters that Tesseract looks for. To do so you must:

Create a file with arbitrary name (i.e. "whitelist") in tesseract config directory. In linux that directory is usually placed in /usr/share/tesseract/tessdata/configs.
Adding a line in that file containing only the characters that are you want to search in text: tessedit_char_whitelist *list_of_characters*
Then call your script using the whitelist vocabulary:
tesseract input.tif output nobatch whitelist

In this case the parameters must be setted in your Python script as:

pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l nobatch whitelist")

would setting -l eng is not equivalent to this (which I have already done) ? £ doesnt seem to be english alphabet — Sandeep Bhutani, Feb 03 '20 at 06:56
I think is not equivalent. £ is not a letter from the alphabet, but a symbol and it could be present in the english character set. It is also the symbol for "english pound"! — Ivan, Feb 03 '20 at 15:26

OCR detecting E as £

1 Answers1