1

I'm trying to convert scanned images to text from tesseract ocr and it is working great except that my images has two languages in it and the tesseract is unable to detect both at once. I can either convert all the images to English (with Arabic being showed as some garbage value not roman Arabic), and vice versa if I convert it to Arabic (that is I get all the text in Arabic, with the English ones as Garbage).

I have tried to detect the exported text with langDetect but given the characters and ASCII are of English letters I'm unable to detect it.

I am sharing a sample of the image here, it would be great if someone can help me get a better solution of the issue.

  • https://stackoverflow.com/questions/24379781/tesseract-how-to-run-tesseract-with-multiple-languages-one-time/27602888 – Seb Nov 17 '19 at 00:33
  • 4
    Possible duplicate of [Tesseract: How to run tesseract with multiple languages one time](https://stackoverflow.com/questions/24379781/tesseract-how-to-run-tesseract-with-multiple-languages-one-time) – Marina Aguilar Nov 17 '19 at 00:35

1 Answers1

0

Just Update your code with this

lang = 'eng+ara'

ara stands for ara.traineddata.

One more thing: arabic trained data might not be in the tesseract so download the ara.traineddata from git and paste it in tessdata folder of tesseract ocr.

I am also giving you the link for this traineddata: link.

Roberto Caboni
  • 7,252
  • 10
  • 25
  • 39