LineBreak in a PDF table breaking tabula-py

Question

I'm using tabula-py to extract a table from a pdf file. This kind of pdf (which I need to parse every month) have around 40 pages (but it varies). My code works just fine for the first 20 pages, which follow a nice standard. However, by the page 30 the output isn't what I wanted. Heres an image example of the table:

What happens is that the second column, B, have a line break and it gives the following output:

My code turns the table into a CSV and then I opened it in Excel.

dfs = tabula.read_pdf(arquivo_nome, pages="all")

for i, df in enumerate(dfs):
    df.to_csv(f"page_{i+1}.csv", index=False)

I've tried using lattice=True and False, but it doesnt work either. I would like what I can do to make the output look like the table (I don't need the line break)

have no experience with tabula, but I solved a similar probem using PyMuPDF here: https://stackoverflow.com/questions/75112240/convert-pdf-tables-to-csv/75115409#75115409 — Jorj McKie, Jan 14 '23 at 02:48

score 0 · Answer 1 · edited Feb 27 '23 at 12:01

0

You need to try this or else share your sample PDF data I sort you out on this platform for example the first table should be:

dfs = tabula.read_pdf(arquivo_nome, pages="all")
dfs = dfs[0]
dfs['B'] = dfs['B'].str.replace('\r', ' ')

edited Feb 27 '23 at 12:01

Nonlinear

684
1
12

answered Feb 22 '23 at 10:15

RonyRyan

1
2

LineBreak in a PDF table breaking tabula-py

1 Answers1