I have a pdf file containing text, images and tables.I want to extract just the tables from that pdf file using either Python or R.
Asked
Active
Viewed 353 times
-3
-
2[Okay](https://stackoverflow.com/help/how-to-ask). – BruceWayne Jan 28 '18 at 06:38
-
2Welcome to Stack Overflow! You seem to be asking for someone to write some code for you. Stack Overflow is a question and answer site, not a code-writing service. Please [see here](http://stackoverflow.com/help/how-to-ask) to learn how to write effective questions. – Sudheesh Singanamalla Jan 28 '18 at 07:09
-
2Asking auntie [Google](https://www.google.de/search?q=R+extract+pdf+data&ie=utf-8&oe=utf-8&client=firefox-b&gfe_rd=cr&dcr=0&ei=kn1tWvTjF46g4gTm06q4Bw) helps – vaettchen Jan 28 '18 at 07:37
2 Answers
2
If you are considering using R I would recommend using the tabulizer package.
it is available here and is very easy to use.
to install it you would have to use the following command:
install.packages("devtools")
devtools::install_github("ropensci/tabulizer")
And using one of their examples:
library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
# When f is your selected pdf file.
out1 <- extract_tables(f)
# Or even better, say what page the tables are in.
out2 <- extract_tables(f, pages = 1, guess = FALSE, method = "data.frame")

Dror Bogin
- 453
- 4
- 13
1
You'll probably find PyPI useful - you can search for specific things on there like 'PDF' and it will give you a list of modules relating to PDF's (here). You'll probably want PDF 1.0 judging from it's weight on PyPI. This should help you get started!

Jamie Crosby
- 152
- 7