0

I was using tesseract-ocr (pytesseract) for spanish and it achieves very high accuracy when you set the language to spanish and of course, the text is in spanish. If you do not set language to spanish this does not perform that good. So, I'm assuming that tesseract is using many postprocessing models for spellchecking and improving the performance, I was wondering if anybody knows some of those models (ie edit distance, noisy channel modeling) that tesseract is applying. Thanks in advance!

Tomas -
  • 91
  • 8
  • Pytesseract is open source and on GitHub. Had you checked that, you would have read that it's a wrapper around [Google Tesseract](https://github.com/tesseract-ocr/tesseract) which is *also* open source and on GitHub. – Jongware Jan 20 '20 at 14:48
  • I already read their wiki on github and did not find what I was looking for. Thank you anyways! – Tomas - Jan 20 '20 at 15:02

1 Answers1

0

Your assumption is wrong: If you do not specify language, tesseract uses English model as default for OCR. That is why you got wrong result for Spanish input text. There is no spellchecking post processing.

user898678
  • 2,994
  • 2
  • 18
  • 17
  • Sorry I should've mention that I knew default the language is english. Actually there is spellchecking post processing since you're using the default language (english). You can realize that given that all accents will be removed and many things more. For instance "oración" will be turned to "oracion" and the same runs for each word with " ´ " since that is not part of the english language. What I read so far is that tesseract choces the best available word with based on: top frequent word,.., top classifier choice word (see "An Overview of the Tesseract OCR Engine", Ray Smith) – Tomas - Jan 22 '20 at 13:06