3

I am trying to build an app which reads arbitrary PDFs and extracts tables from them and I am using Camelot for extracting the tables. This is working fine for tables in which cells have single line values. However, for tables having cells with multi-line values, Camelot is splitting the multi-line text in a single cell, into multiple cells. Since Camelot is built on top of pdfminer, I tried to tweak the layout analysis parameters (specifically line_margin) to make Camelot not split the lines. However, the issue remains.

What other parameters can I tweak to handle this issue? Here is an example of the tables which have this issue. enter image description here

I do not want to use the 'lattice' flavor as most of the tables that I expect to see do not have demarcating lines.

Rohit Gavval
  • 227
  • 1
  • 13

1 Answers1

2

If your PDFs tables have lines that are brighter than the cells, as in your example, then you might try lattice flavour with process_background=True.

tables = camelot.read_pdf('background_lines.pdf', process_background=True)

See, https://camelot-py.readthedocs.io/en/master/user/advanced.html

Angoose
  • 31
  • 2