How can I stop camelot-py from splitting multi-line text in a single cell into multiple cells?

Question

I am trying to build an app which reads arbitrary PDFs and extracts tables from them and I am using Camelot for extracting the tables. This is working fine for tables in which cells have single line values. However, for tables having cells with multi-line values, Camelot is splitting the multi-line text in a single cell, into multiple cells. Since Camelot is built on top of pdfminer, I tried to tweak the layout analysis parameters (specifically line_margin) to make Camelot not split the lines. However, the issue remains.

What other parameters can I tweak to handle this issue? Here is an example of the tables which have this issue.

I do not want to use the 'lattice' flavor as most of the tables that I expect to see do not have demarcating lines.

In my experience, with 'stream' flavor, each line becomes a row. — Stefano Fiorucci - anakin87, May 11 '20 at 07:05
Yes, that behavior is causing the problem. Is there a way to override the behavior? — Rohit Gavval, May 12 '20 at 07:23
@RohitGavval any luck on this? I am having the same problem. — Pramesh Bajracharya, Jan 05 '21 at 08:07

score 2 · Answer 1 · answered Mar 29 '21 at 09:29

If your PDFs tables have lines that are brighter than the cells, as in your example, then you might try lattice flavour with process_background=True.

tables = camelot.read_pdf('background_lines.pdf', process_background=True)

See, https://camelot-py.readthedocs.io/en/master/user/advanced.html

How can I stop camelot-py from splitting multi-line text in a single cell into multiple cells?

1 Answers1