I'm using tabula-py to extract a table from a pdf file. This kind of pdf (which I need to parse every month) have around 40 pages (but it varies). My code works just fine for the first 20 pages, which follow a nice standard. However, by the page 30 the output isn't what I wanted. Heres an image example of the table:
What happens is that the second column, B, have a line break and it gives the following output:
My code turns the table into a CSV and then I opened it in Excel.
dfs = tabula.read_pdf(arquivo_nome, pages="all")
for i, df in enumerate(dfs):
df.to_csv(f"page_{i+1}.csv", index=False)
I've tried using lattice=True and False, but it doesnt work either. I would like what I can do to make the output look like the table (I don't need the line break)