I am currently using tabula.read_pdf()
to extract tables from a pdf. However, there are no information about which page does the table come from. One way is to get the total number of pages and iterate each page by passing in the pages
argument for tabula.read_pdf()
. This however is extremely inefficient. Following is some explanation, and I am using an example pdf here http://www.annualreports.com/HostedData/AnnualReports/PDF/NASDAQ_AMZN_2019.pdf
%%time
for i in range(1,88):
tables = read_pdf(pdf_path, pages=i, stream=True)
# CPU times: user 803 ms, sys: 686 ms, total: 1.49 s
# Wall time: 3min 4s
%%time
tables = read_pdf(pdf_path, pages='all', stream=True)
# CPU times: user 402 ms, sys: 171 ms, total: 573 ms
# Wall time: 21.2 s