0

I have a pdf that's about 50 pages of scanned tables. I need to eventually scrape it into R so I can clean the data and export it as a .csv. I have experience scraping readable pdfs with tabulizer but I've never really worked with scanned pdfs before and tabulizer can't read them.

Looking around online, the farthest I've been able to get is reading the scanned pdf into R as a single character object but this shifts the formatting around a lot so the columns of the table are all misaligned and out of order. Even if it were still nicely formatted I don't know how to then get the character object into a final df.

Is there a way to convert the scanned pdf into a readable pdf so I can scrape it in tabulizer? Or another method for scraping scanned pdfs into tables?

1 Answers1

1

The tesseract::ocr function can read PDF files and convert them to text. You can then process that as an R Markdown document and produce a (probably pretty ugly!) PDF document:

library(tesseract)

eng <- tesseract("eng")

yaml <- '
---
output: pdf_document
---'
text <- tesseract::ocr("scanned.pdf", engine=eng)
lines <- unlist(strsplit(text, "\n"))
lines2 <- c(yaml, lines2)
writeLines(lines2, "ocr.Rmd")

Then run R Markdown on that document. You'll get lots of OCR errors, so edit the .Rmd file to fix them, and do it again (and again...).

user2554330
  • 37,248
  • 4
  • 43
  • 90