0

I would like to read a scanned PDF document into R using tesseract. In general, this already works quite well, but I have problems when the documents have a table structure. After some time of research I found out that there is a parameter to set the Page Segmentation Method (PSM). In fact, the default is designed for book pages, so changing this parameter should result in an increase in performance.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#page-segmentation-method

Now I would like to set this PSM parameter, but I don't know where to find it. Most instructions and tutorials are for Python, but for my project I use R. I have already read that you can pass a named list to the options parameter, but I can't find a suitable method.

Your help would be greatly appreciated, I don't know where else to look.

Thanks in advance!

RKF
  • 131
  • 7

1 Answers1

0

As far as I understand, you can customize the engine as you see fit. You do it by changing its parameters through the options argument in tesseract function. Something like that:

my_engine <- tesseract(options = list(tessedit_pageseg_mode = 1))

Or just put in directly into engine argument in ocr or ocr_data functions:

text <- image_read("your_image.png") %>%
  ocr(engine = tesseract(options = list(tessedit_pageseg_mode = 1)))