-3

I have a pdf file containing text, images and tables.I want to extract just the tables from that pdf file using either Python or R.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 2
    [Okay](https://stackoverflow.com/help/how-to-ask). – BruceWayne Jan 28 '18 at 06:38
  • 2
    Welcome to Stack Overflow! You seem to be asking for someone to write some code for you. Stack Overflow is a question and answer site, not a code-writing service. Please [see here](http://stackoverflow.com/help/how-to-ask) to learn how to write effective questions. – Sudheesh Singanamalla Jan 28 '18 at 07:09
  • 2
    Asking auntie [Google](https://www.google.de/search?q=R+extract+pdf+data&ie=utf-8&oe=utf-8&client=firefox-b&gfe_rd=cr&dcr=0&ei=kn1tWvTjF46g4gTm06q4Bw) helps – vaettchen Jan 28 '18 at 07:37

2 Answers2

2

If you are considering using R I would recommend using the tabulizer package.
it is available here and is very easy to use. to install it you would have to use the following command:

install.packages("devtools")
devtools::install_github("ropensci/tabulizer")

And using one of their examples:

library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
# When f is your selected pdf file.
out1 <- extract_tables(f)
# Or even better, say what page the tables are in.
out2 <- extract_tables(f, pages = 1, guess = FALSE, method = "data.frame")
Dror Bogin
  • 453
  • 4
  • 13
1

You'll probably find PyPI useful - you can search for specific things on there like 'PDF' and it will give you a list of modules relating to PDF's (here). You'll probably want PDF 1.0 judging from it's weight on PyPI. This should help you get started!

Jamie Crosby
  • 152
  • 7