How to properly extract all the tables from a .pdf file and writing it in a .csv file in R?

Question

I am working on extracting tables from PDF and I am writing that in a csv file. When I executed the code, the tables were not properly written in the csv file.

Here is my code:

library(tabulizer)

location <- 'http://keic.mica-pps.net/wwwisis/ET_Annual_Reports/Religare_Enterprises_Ltd/RELIGARE-2017-2018.pdf')

out <- extract_tables(location)

for(i in 1:length(out)) {
    write.table(out[i], file='Output.csv',append=TRUE, sep=",",quote = FALSE)
}

I enclosed the screenshot of the output file. In that you can see the tables are incomplete.

Any help would be appreciated.

Here's another SO Q&A with an alternate approach to extraction after Tabula failures: https://stackoverflow.com/questions/67489987/pdf-scraping-get-company-and-subsidiaries-tables/67658530#67658530 — IRTFM, Jun 21 '21 at 22:33

score 0 · Answer 1 · answered Jun 20 '21 at 16:03

Dealing with pdfs can be very hard and very specific to the files you have at hand. You will probably need to do lots of tweaking to get the data in a usable format.

Have a look here for an example (https://github.com/b-rodrigues/stats_historiques/blob/master/stats_historiques.R) and the results (https://twitter.com/brodriguesco/status/1405995811863945223)

score 0 · Answer 2 · edited Mar 30 '22 at 17:48

0

I saw this on the internet:

tweets.df <- do.call("rbind", lapply(a, as.data.frame))

So it is better to first have a dataframe and then use write.csv().

edited Mar 30 '22 at 17:48

Ethan

876
8
18
34

answered Mar 22 '22 at 03:38

Alejandro Muñoz Fernández

1

How to properly extract all the tables from a .pdf file and writing it in a .csv file in R?

2 Answers2