Tabulizer extraction missings

Asked Apr 07 '17 at 11:13

Active Jan 13 '21 at 08:42

Viewed 655 times

I'm using extract_tables from the tabulizer-package to extract tables from a PDF file. Everything works fine but if the table is with less than 4 lines with headers it's not extracted. If table is more than 4 lines it's properly extracted.

This is the code that I use :

text <- extract_tables("file path, file name")
table <- do.call(rbind, text)
table <- as.data.frame(table)

I also tried solution with fixing area:

text <- extract_tables("file path, file name", area = c(0,0,595,842))

But in this case some columns are missing and some columns are merged.

Did someone face the same issue and knows how to solve it?

edited Jan 13 '21 at 08:42

zx8754

52,746
12
114
209

asked Apr 07 '17 at 11:13

IKostow

Have you tried the `columns` argument? Anyway in my experience `tabulizer` is never 100% reliable... – Scarabee Apr 07 '17 at 12:17
1

I think you should add the `tabula` tag to your question, and you will be a bit more likely to get answers (`tabula` is the Java library used by `tabulizer`). – Scarabee Apr 07 '17 at 12:19
1

This is an old question, but recently struggling with similar issue and found that the pdftools package can be used to locate the area input to extract_tables and improving the reliability considerably. I wrote a blog post walk through. https://redwallanalytics.com/2020/04/07/tabulizer-and-pdftools-r-libraries-as-super-powers-part-2/ – David Lucey Apr 08 '20 at 12:43

Tabulizer extraction missings

0 Answers0