I'm trying to work on something similar (parsing bank statements) and had the same issue. The only way to solve this I have found so far is to parse each page individually.
The only problem is that this requires to know in advance how many pages your file is composed of. For the moment I have not found a how to do this directly with Tabula, so I've decided to use the pyPdf module to get the number of pages.
import pyPdf
from tabula import read_pdf
reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages()
df = []
for page in [str(i+1) for i in range(n)]:
if page == "1":
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
else:
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))
Notice that there are some known and open issues when reading each page individually, or all at the same time.
Good luck!
08/03/2017 EDIT:
Found a simpler way to count the pages of the pdf without going through pyPDf
import re
def count_pdf_pages(file_path):
rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
with open(file_path, "rb") as temp_file:
return len(rxcountpages.findall(temp_file.read()))
where file_path is the path to your file of course