Extracting Tables from PDFs Using Tabula

Question

I came across a great library called Tabula and it almost did the trick. Unfortunately, there is a lot of useless area on the first page that I don't want Tabula to extract. According to documentation, you can specify the page area you want to extract from. However, the useless area is only on the first page of my PDF file, and thus, for all subsequent pages, Tabula will miss the top section. Is there a way to specify the area condition to only apply to the first page of the PDF?

from tabula import read_pdf

df = read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages='all')

score 4 · Answer 1 · edited Jan 18 '18 at 16:19

I'm trying to work on something similar (parsing bank statements) and had the same issue. The only way to solve this I have found so far is to parse each page individually.

The only problem is that this requires to know in advance how many pages your file is composed of. For the moment I have not found a how to do this directly with Tabula, so I've decided to use the pyPdf module to get the number of pages.

import pyPdf
from tabula import read_pdf

reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages() 

df = []
for page in [str(i+1) for i in range(n)]:
    if page == "1":
            df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
    else:
            df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))

Notice that there are some known and open issues when reading each page individually, or all at the same time.

Good luck!

08/03/2017 EDIT:

Found a simpler way to count the pages of the pdf without going through pyPDf

import re
def count_pdf_pages(file_path):
    rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
    with open(file_path, "rb") as temp_file:
        return len(rxcountpages.findall(temp_file.read()))

where file_path is the path to your file of course

Getting error `ModuleNotFoundError: No module named 'pdf'`. – Piyush S. Wanare Jul 24 '18 at 14:29 — Piyush S. Wanare, Jul 24 '18 at 14:29

score 2 · Answer 2 · answered Mar 16 '19 at 21:14

Use the below code ! It may help you !!!

import os
os.path.abspath("E:/Documents/myPy/")
from tabula import wrapper
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')

i=1
for table in tables:
    table.to_excel('output'+str(i)+'.xlsx',index=False)
    print(i)
    i=i+1

score 1 · Answer 3 · edited Dec 23 '19 at 17:18

1

parameter'guess=False' will solve the problem.

edited Dec 23 '19 at 17:18

double-beep

5,031
17
33
41

answered Dec 23 '19 at 16:29

mikhael

11
1

dataninsight · Answer 4 · 2021-11-24T17:36:54.693

Extracting Tables from PDFs Using Tabula

pip install tabula-py
pip install tabulate
#reads table from pdf file
df = read_pdf("abc.pdf", pages=[2:]) #address of pdf file
print(tabulate(df))

Parameters:

pages (str, int, list of int, optional) An optional values specifying pages to extract from. It allows str,int, list of :int. Default: 1

Examples

'1-2,3', 'all', [1,2]

since the first page is useless dropping first page and reading upto last page

Extracting Tables from PDFs Using Tabula

4 Answers4

Extracting Tables from PDFs Using Tabula

Linked