0

I have a PDF document.

It contains data in tabular format. I want to extract the data into a comma delimited text file using the comma as column delimiters.

Any suggestions?

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
user3079559
  • 417
  • 5
  • 16

1 Answers1

4

Standard PDFs do not provide any hints about the semantics of what they draw on a page: the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.

Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.

For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:

Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)

Having said the above now let me add this:

Tabula is written in Ruby.


Update

Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

asciicast

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345