0

I have one pdf file, it has 40 tables in different pages. I want to extract each table with its page number.

I have tried to use this code:

import camelot

tables = camelot.read_pdf('2003.pdf', flavor='stream', pages='8,9,10,14,15,18,24...', edge_tol=500, flag_size=True)
for page in range(tables.n):
    tables[page].to_csv(f"2003\Report2003tab{page+1}_page.csv")

The output is

Report2003tab1_page.csv
Report2003tab2_page.csv
.
.

But I want to have output like this:

Report2003tab1_page8.csv
Report2003tab2_page9.csv
Report2003tab3_page10.csv
Report2003tab4_page14.csv

How can I also include the page numbers in the output?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65

1 Answers1

1

As mentioned in the Camelot Quickstart guide,

Now, we have a TableList object called tables, which is a list of Table objects. We can get everything we need from this object.

[…]

Let's print the parsing report.

print tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}

So you should be able to use:

for i in range(tables.n):
    table = tables[i]
    page = table.parsing_report['page']
    table.to_csv(f"2003\Report2003tab{i+1}_page{page}.csv")
mkrieger1
  • 19,194
  • 5
  • 54
  • 65