just like scraping data off the web , either from html or json , can the same be done in pdfs using R?

Asked Nov 14 '14 at 04:23

Active Nov 14 '14 at 04:23

Viewed 232 times

I would like to import tables and table-like data in research articles(pdf files) into R.

thats the pdf taken as an example here. Simple tables to start with. Page 6 of the pdf file I have taken a screenshot to understand the scenario.

How do I extract that table ? enter image description here

asked Nov 14 '14 at 04:23

user3563667

1

"How do I extract that table?" With difficulty. 1. Not all text in all PDFs is *extractable* "by default". Some of it may not be text, some of it may not be properly encoded, some of it may be badly encoded. 2. The text ordering may not be the one you expect. 3. There is no concept of 'table' and 'tabs' in a PDF. (FYI, this table in your sample PDF extracts just fine. But that's just a coincidence.) – Jongware Nov 14 '14 at 09:57
1

... 4. Since there is no concept of a 'table' in the PDF specification, how would your imagined scenario recognize that particular sequence of texts on page 6 as a table to extract? – Jongware Nov 14 '14 at 10:05
1

Here is a suggestion to pursue, from sketchy notes I have. "one could use tabula[1] to extract tables from pdfs as csv. [1] http://tabula.nerdpower.org" – lawyeR Nov 14 '14 at 16:42

0 Answers0