2

I was parsing bank statement using tabula-py in which columns are seperated by vertical margins but row are not separated. so i use stream mode but if in any page there is not entry for any column then tabula merges them as one for code

tables=tabula.read_pdf("pdfname.pdf",pages='all')

So i use columns option to manually select columns

tables=tabula.read_pdf("pdfname.pdf",pages='all',columns= ['27.0,68.0,272.0,357.5,397.0,474.5,553.0,631.0'])

but it does nothing like tabula is not even reading the options output is same as previous Sorry i can not post the table for privacy purposes.

[my tables is somewhat like it you can check image at https://i.stack.imgur.com/f40V0.png]

2 Answers2

0

The columns keyword argument should be an array of numbers:

tables = tabula.read_pdf("pdfname.pdf",
                         pages='all',
                         columns=[27.0,68.0,272.0,357.5,397.0,474.5,553.0,631.0])
0

As far as I know, tabula-py is just a wrapper of tabula-java, so the extraction accuracy is the same as tabula app. Try PDFplumber instead.

blackbrandt
  • 2,010
  • 1
  • 15
  • 32
chezou
  • 486
  • 4
  • 12