1

I am currently using tabula.read_pdf() to extract tables from a pdf. However, there are no information about which page does the table come from. One way is to get the total number of pages and iterate each page by passing in the pages argument for tabula.read_pdf(). This however is extremely inefficient. Following is some explanation, and I am using an example pdf here http://www.annualreports.com/HostedData/AnnualReports/PDF/NASDAQ_AMZN_2019.pdf

%%time
for i in range(1,88):
    tables = read_pdf(pdf_path, pages=i, stream=True)
# CPU times: user 803 ms, sys: 686 ms, total: 1.49 s
# Wall time: 3min 4s

%%time
tables = read_pdf(pdf_path, pages='all', stream=True)
# CPU times: user 402 ms, sys: 171 ms, total: 573 ms
# Wall time: 21.2 s
Stanley Gan
  • 481
  • 1
  • 7
  • 19
  • Could you please explain more why it is inefficient? – Prefect May 14 '20 at 19:57
  • Hi lammuratc, I just edited my question and added more details about the time taken for iterating each page vs. using pages='all'. – Stanley Gan May 14 '20 at 20:11
  • I see your point. I don't know much about libraries for pdf files. Can't you just iterate the pdf once and save all the tables? Or maybe you can use another language, you know Python is not the fastest :) – Prefect May 14 '20 at 20:45

1 Answers1

0

You can use camelot instead of tabula.

One cool feature of Camelot is that you also get a “parsing report” for each table giving an accuracy metric, the page the table was found on, and the percentage of whitespace present in the table.

file = "your_file_path"
tables = camelot.read_pdf(file, pages = "1-end")
# get the 3rd-indexed-table
tables[3].df
# get the information of the third table, you will find the page
tables[3].parsing_report

Reference : http://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459