1

I am trying to load a large table (an example is attached) from form 10-K into Python using tabula-py. The table does not have clear border, and have a lot of blank cells, which cause several issues.

My code is

df = tabula.read_pdf("firm_xxx_10K.pdf", pages='100-101',guess=True,stream=True,columns=(144,210,300,340,380,420,450))

With stream=True, I get all the data, but the information in multiple rows are recognized as separate entries. With lattice=True, then the cells with multiple rows are correctly recognized as one cell, but now the results miss a lot of observations.

Is there a better way to set the options? I tried many options, but now I am stuck. Any help is much appreciated. Best,

Example of the Table I am Trying to Read

bharatk
  • 4,202
  • 5
  • 16
  • 30
ynchoir
  • 11
  • 2

0 Answers0