0

I would like to extract tables from a multiple page pdf. Because of the table properties, I need to use the flavor='stream' and table_areas properties to read_pdf for my table to be properly detected. My problem is that the position of the table is different on each page (the first page has an address head and not the other)

I have tried to provide several areas to the read_pdf function such as follows:

camelot.read_pdf(file, pages='all', flavor='stream', table_areas=['60, 740, 580, 50','60, 470, 580, 50'])

but this result as having 2 tables per page. How can I specify the table_areas for each page separately?

I have also tried to run several times read_pdf with different pages/table_areas, how ever then I cannot append the several result together to have a single objet:

tables = camelot.read_pdf(file, pages='1', flavor='stream', table_areas=['60, 470, 580, 50'])
tables.append(camelot.read_pdf(file, pages='2-end', flavor='stream', table_areas=['60, 740, 580, 50']))

gives an error as append is not a method of resulting tables

Is there a way to concatenate the results of several call of the read_pdf function?

Oneira
  • 1,365
  • 1
  • 14
  • 28

2 Answers2

1

Actually, as you noticed, you can't add items directly to the TableList object.

Instead, you can manipulate TableList _tables property (_tables is a list), in the following way:

my_tables = camelot.read_pdf(file, pages='1', flavor='stream', table_areas=['60, 470, 580, 50'])
my_tables._tables.append(camelot.read_pdf(file, pages='2-end', flavor='stream', table_areas=['60, 740, 580, 50']))

Now my_tables should consist of two tables.

0

this answer works but you need

my_tables._tables.extend(...)

instead of

my_tables._tables.append(...)

PSAfrance
  • 11
  • 2