0

I would like to import tables and table-like data in research articles(pdf files) into R.

example : http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

thats the pdf taken as an example here. Simple tables to start with. Page 6 of the pdf file I have taken a screenshot to understand the scenario.

How do I extract that table ?enter image description here

user3563667
  • 293
  • 4
  • 14
  • 1
    "How do I extract that table?" With difficulty. 1. Not all text in all PDFs is *extractable* "by default". Some of it may not be text, some of it may not be properly encoded, some of it may be badly encoded. 2. The text ordering may not be the one you expect. 3. There is no concept of 'table' and 'tabs' in a PDF. (FYI, this table in your sample PDF extracts just fine. But that's just a coincidence.) – Jongware Nov 14 '14 at 09:57
  • 1
    ... 4. Since there is no concept of a 'table' in the PDF specification, how would your imagined scenario recognize that particular sequence of texts on page 6 as a table to extract? – Jongware Nov 14 '14 at 10:05
  • 1
    Here is a suggestion to pursue, from sketchy notes I have. "one could use tabula[1] to extract tables from pdfs as csv. [1] http://tabula.nerdpower.org" – lawyeR Nov 14 '14 at 16:42

0 Answers0