0

I'm using extract_tables from the tabulizer-package to extract tables from a PDF file. Everything works fine but if the table is with less than 4 lines with headers it's not extracted. If table is more than 4 lines it's properly extracted.

This is the code that I use :

text <- extract_tables("file path, file name")
table <- do.call(rbind, text)
table <- as.data.frame(table) 

I also tried solution with fixing area:

text <- extract_tables("file path, file name", area = c(0,0,595,842))

But in this case some columns are missing and some columns are merged.

Did someone face the same issue and knows how to solve it?

zx8754
  • 52,746
  • 12
  • 114
  • 209
IKostow
  • 11
  • 1
  • Have you tried the `columns` argument? Anyway in my experience `tabulizer` is never 100% reliable... – Scarabee Apr 07 '17 at 12:17
  • 1
    I think you should add the `tabula` tag to your question, and you will be a bit more likely to get answers (`tabula` is the Java library used by `tabulizer`). – Scarabee Apr 07 '17 at 12:19
  • 1
    This is an old question, but recently struggling with similar issue and found that the pdftools package can be used to locate the area input to extract_tables and improving the reliability considerably. I wrote a blog post walk through. https://redwallanalytics.com/2020/04/07/tabulizer-and-pdftools-r-libraries-as-super-powers-part-2/ – David Lucey Apr 08 '20 at 12:43

0 Answers0