4

I'm trying to extract all the tables that are contained in a pdf document (about 250 pages). The problem is not extraction. Problem is identifying the tables. With my algo it is taking junk data too like contents, sometimes bullet points which I don't want. I specifically want tables with grid lines only.

from PyPDF2 import PdfFileWriter, PdfFileReader
from tabula import read_pdf
pages_required=[]
reader = PdfFileReader(open("input.pdf", mode='rb' ))
n = reader.getNumPages()
for page in [str(i+1) for i in range(n)]:
    df=read_pdf(r"input.pdf", pages=page)
    if df is not None:
        pages_required.append(page)
print(pages_required)

This filters out pages for me to an extent but not completely. I need an array of only those page numbers which have tables with grid lines. Is there a way around?

Mehul Verma
  • 123
  • 1
  • 8
  • 1
    Difficult to help without seeing an example of your target PDFs. – Peter Leimbigler Sep 28 '18 at 15:12
  • https://on.tcs.com/AnnualReport2018 – Mehul Verma Sep 29 '18 at 04:57
  • 1
    Have you tried this one https://camelot-py.readthedocs.io/en/master/user/advanced.html looks like it will fit to your case @MehulVerma – Arpit Solanki Nov 09 '18 at 16:03
  • [This](https://medium.com/analytics-vidhya/how-to-extract-multiple-tables-from-a-pdf-through-python-and-tabula-py-6f642a9ee673) and/or [this](https://towardsdatascience.com/how-to-extract-tables-from-pdf-using-python-pandas-and-tabula-py-c65e43bd754) may be of help. – Gonçalo Peres May 27 '21 at 16:01

0 Answers0