0

I am trying to extract tabular data from text-based pdfs. PDFs are of different formats and I have to make a generalised solution. I came across one library named "pdftabextract" for this task. But, it works on scanned documents and has been designed for the same.

I want to use it for my text-based pdfs, but don't know how to do it.

Article Link : https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/

The above article shows step by step approach. But, I don't know how to use that for text-based pdfs. Please help.

  • Take a look at [tabula-py](https://pypi.org/project/tabula-py/). – RJ Adriaansen Jul 12 '21 at 09:59
  • As @RJAdriaansen says, tabula-py or pdfminer. Both have positives and negatives, and might work depending on the PDF, so a generalized solution may be a challenge. The trouble is that the PDF language allows the creation of what look like tables to the human eye, but the data is not arranged internally in a tabular fashion. Another approach might be opening each PDF in Word and then saving as Text: sometimes the Word convertor does a better job. I've asked a similar question (with some code): https://stackoverflow.com/questions/68228128/handling-blank-cells-when-importing-pdf-tables-as-text – DS_London Jul 12 '21 at 10:49

0 Answers0