0

I am using pytesseract (version 5 of tesseract) to scan an image. I have changed image to black and white to remove the noise but still E is being detected as £196893 .
Also tried setting the language, dpi and psm values which has been suggested by most of people. Below are the settings I am using now. Please suggest.

pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l eng")

Once of sample picture is shown below. For some samples it is working fine but for some samples it is giving such strange characters.

sample picture

Sandeep Bhutani
  • 589
  • 7
  • 23

1 Answers1

0

A solution to overcome this issue is to limit the characters that Tesseract looks for. To do so you must:

  1. Create a file with arbitrary name (i.e. "whitelist") in tesseract config directory. In linux that directory is usually placed in /usr/share/tesseract/tessdata/configs.
  2. Adding a line in that file containing only the characters that are you want to search in text: tessedit_char_whitelist *list_of_characters*
  3. Then call your script using the whitelist vocabulary:
    tesseract input.tif output nobatch whitelist

In this case the parameters must be setted in your Python script as:

pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l nobatch whitelist")
Ivan
  • 146
  • 2
  • would setting -l eng is not equivalent to this (which I have already done) ? £ doesnt seem to be english alphabet – Sandeep Bhutani Feb 03 '20 at 06:56
  • I think is not equivalent. £ is not a letter from the alphabet, but a symbol and it could be present in the english character set. It is also the symbol for "english pound"! – Ivan Feb 03 '20 at 15:26