reading pdf file using tabula

Question

I have a pdf file with tables in it and would like to read it as a dataframe using tabula. But only the first page has column header. While reading using

tabula.read_pdf(pdf_file, pages='all', lattice = 'True')

the data is coming in desired format and all the pages are extracted properly however while using

pd.DataFrame(tabula.read_pdf(pdf_file, pages='all', lattice = 'True')

showing only some rows.

Welcome to SO! You will need to provide some data in order for anyone to help you. You could for instance post a part of the output of ```tabula.red_pdf()``` where you know the ```pd.DataFrame()``` part misses rows. — Serge de Gosson de Varennes, Nov 30 '22 at 10:04

score 0 · Answer 1 · answered Nov 30 '22 at 10:08

0

You should actually do it this way (assumming your pdf doesn't contain both text and tables)

table = tabula.read_pdf(pdf_file, pages='all',output_format="dataframe" ,lattice = 'True')

answered Nov 30 '22 at 10:08

Serge de Gosson de Varennes

7,162
3
18
39

thanx mate now able to extract the complete data but now getting error while concatenating two dataframe - TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid – arvin Nov 30 '22 at 10:29
What is your full code? Somewere in the pdf, there must be text that is viewed as list. As I mentioned in your questions comment, it is hard to help without knowing how the pdf looks like. Can you add some info to your question. – Serge de Gosson de Varennes Nov 30 '22 at 10:41
https://ibb.co/gtYHt13 - link to a snapshot of the pdf – arvin Nov 30 '22 at 11:32

reading pdf file using tabula

1 Answers1