Extract data from PDF document

Question

I have a PDF document.

It contains data in tabular format. I want to extract the data into a comma delimited text file using the comma as column delimiters.

Any suggestions?

You can try Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs. Or you can explore more like itextpdf, etc. — Abubakkar, Apr 15 '15 at 07:42
You may try Algodocs: https://algodocs.com. It works perfectly for PDFs with tables that even span to hundreds of pages. — Zhavat, Feb 10 '21 at 23:21

Kurt Pfeifle · Accepted Answer · 2015-07-01T03:50:43.517

Standard PDFs do not provide any hints about the semantics of what they draw on a page: the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.

Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.

For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:

Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)

Having said the above now let me add this:

For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:

Tabula is written in Ruby.

Update

Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

Extract data from PDF document

1 Answers1

Update