0

I'm using tabula-py to extract a table from a pdf file. This kind of pdf (which I need to parse every month) have around 40 pages (but it varies). My code works just fine for the first 20 pages, which follow a nice standard. However, by the page 30 the output isn't what I wanted. Heres an image example of the table:

enter image description here

What happens is that the second column, B, have a line break and it gives the following output:

enter image description here

My code turns the table into a CSV and then I opened it in Excel.

dfs = tabula.read_pdf(arquivo_nome, pages="all")

for i, df in enumerate(dfs):
    df.to_csv(f"page_{i+1}.csv", index=False)

I've tried using lattice=True and False, but it doesnt work either. I would like what I can do to make the output look like the table (I don't need the line break)

viniwata1
  • 31
  • 4
  • have no experience with tabula, but I solved a similar probem using PyMuPDF here: https://stackoverflow.com/questions/75112240/convert-pdf-tables-to-csv/75115409#75115409 – Jorj McKie Jan 14 '23 at 02:48

1 Answers1

0

You need to try this or else share your sample PDF data I sort you out on this platform for example the first table should be:

dfs = tabula.read_pdf(arquivo_nome, pages="all")
dfs = dfs[0]
dfs['B'] = dfs['B'].str.replace('\r', ' ')
Nonlinear
  • 684
  • 1
  • 12
RonyRyan
  • 1
  • 2