Extraction of tables from PDF

Question

I have a pdf file containing text, images and tables.I want to extract just the tables from that pdf file using either Python or R.

Welcome to Stack Overflow! You seem to be asking for someone to write some code for you. Stack Overflow is a question and answer site, not a code-writing service. Please [see here](http://stackoverflow.com/help/how-to-ask) to learn how to write effective questions. — Sudheesh Singanamalla, Jan 28 '18 at 07:09
Asking auntie [Google](https://www.google.de/search?q=R+extract+pdf+data&ie=utf-8&oe=utf-8&client=firefox-b&gfe_rd=cr&dcr=0&ei=kn1tWvTjF46g4gTm06q4Bw) helps — vaettchen, Jan 28 '18 at 07:37

score 2 · Answer 1 · answered Jan 28 '18 at 08:10

If you are considering using R I would recommend using the tabulizer package.
it is available here and is very easy to use. to install it you would have to use the following command:

install.packages("devtools")
devtools::install_github("ropensci/tabulizer")

And using one of their examples:

library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
# When f is your selected pdf file.
out1 <- extract_tables(f)
# Or even better, say what page the tables are in.
out2 <- extract_tables(f, pages = 1, guess = FALSE, method = "data.frame")

score 1 · Answer 2 · answered Jan 28 '18 at 06:58

You'll probably find PyPI useful - you can search for specific things on there like 'PDF' and it will give you a list of modules relating to PDF's (here). You'll probably want PDF 1.0 judging from it's weight on PyPI. This should help you get started!

Extraction of tables from PDF

2 Answers2