I am trying to load a large table (an example is attached) from form 10-K into Python using tabula-py. The table does not have clear border, and have a lot of blank cells, which cause several issues.
My code is
df = tabula.read_pdf("firm_xxx_10K.pdf", pages='100-101',guess=True,stream=True,columns=(144,210,300,340,380,420,450))
With stream=True
, I get all the data, but the information in multiple rows are recognized as separate entries. With lattice=True
, then the cells with multiple rows are correctly recognized as one cell, but now the results miss a lot of observations.
Is there a better way to set the options? I tried many options, but now I am stuck. Any help is much appreciated. Best,