how to change tesseract config to recognize § and apply with pdftools::pdf_ocr_text in R?

Question

I am using pdftools in R to extract text from both scanned and text based PDF files. One problem is with the § character. This is not recognized by tesseract.

I looked at the following links: CRAN tesseract package vignette

SO link of a similar question

and this github page

And I tried the following:

I found the configuration files using tesseract_info() and edited the digits file under configs. The digits file content was like this:

tessedit_char_whitelist 0123456789.

After editing it looks like this:

tessedit_char_whitelist 0123456789-$§.

This did not change anything at all, I am still not able to extract §. They still appear as 8.

After the 1st step failed, I tried the following:

filepng <- pdftools::pdf_convert(filePathPDF, dpi = 600)

specs <- tesseract("deu", options = list(tessedit_char_whitelist = "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@ß€!$%&§/()=?+"))

text <- tesseract::ocr(filepng, engine = specs)

This one failed too. I am by no means an expert on OCR and tesseract has room for improvements when it comes to documentation.

How can I add § to the list of characters to be recognized in the right way, so that it applies?

Update

The following works to recognize §, when I remove language from the argument list:

charlist <- tesseract(options = list(tessedit_char_whitelist = " 1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@ß€!$%&§/()=?+"))

text <- tesseract::ocr(filepng, engine = charlist)

But this time, I am losing German umlauts. I cannot find out how I can specify the language and the char_whitelist at the same time. According to the documentation, tesseract() accepts language argument and options argument. But this does not seem to work. Any ideas?

Update: I tried using tesseract in command line (MacOS Catalina 10.15.7).

I converted a scanned PDF file first to an image then used this:

tesseract fileConverted.tiff fileToText

It creates fileToText.txt. It does recognize §. All of them are correctly recognized. But German umlauts are not recognized correctly, since I did not specify language at all. When I use the same command with the language argument

tesseract fileConverted.tiff fileToText -l deu

German umlauts are recognized properly but § is not.

The digits config file I changed is here:

/usr/local/Cellar/tesseract/4.1.1/share/tessdata/configs

My understanding is: it is not a problem specific to R, but it occurs with tesseract itself. Setting tessedit_char_whitelist and the language at the same time does not seem to be possible or I am missing something horribly.

It seems that tesseract 4 [does not support setting a whitelist](https://github.com/tesseract-ocr/tesseract/issues/751). Maybe it helps to downgrade to tesseract 3? Or maybe something else changed in the meantime, as the issue is already quite old? — starja, Dec 10 '20 at 20:42
I had a similar problem that it did not recognize the £ symbol. I retrained a models starting from an existing model and some gold data with image lines https://github.com/ropensci/tesseract/issues/50 — , Dec 11 '20 at 15:33
`eng.trainnedata` seems to have the character § but `deu.trainneddata` does not. Can you try with both languages? `-l eng+deu` or `-l deu+eng` — nguyenq, Dec 20 '20 at 18:11

score 1 · Accepted Answer · answered Dec 14 '20 at 12:15

As said above, tesseract 4 does not support setting a whitelist. To go around that problem, you could use the command-line switch. You need to set OCR Engine mode to the "Original Tesseract only" with --oem 0 then use -c tessedit_char_whitelist=abc... to pass your whitelist directly via the command-line.

Overall, it should look something like this : tesseract fileConverted.tiff fileToText --oem 0 -l deu -c tessedit_char_whitelist=0123456789-$§

how to change tesseract config to recognize § and apply with pdftools::pdf_ocr_text in R?

1 Answers1