1

I have a pdf file with tables in it and would like to read it as a dataframe using tabula. But only the first page has column header. While reading using

tabula.read_pdf(pdf_file, pages='all', lattice = 'True')

the data is coming in desired format and all the pages are extracted properly however while using

pd.DataFrame(tabula.read_pdf(pdf_file, pages='all', lattice = 'True')

showing only some rows.

arvin
  • 9
  • 4
  • Welcome to SO! You will need to provide some data in order for anyone to help you. You could for instance post a part of the output of ```tabula.red_pdf()``` where you know the ```pd.DataFrame()``` part misses rows. – Serge de Gosson de Varennes Nov 30 '22 at 10:04

1 Answers1

0

You should actually do it this way (assumming your pdf doesn't contain both text and tables)

table = tabula.read_pdf(pdf_file, pages='all',output_format="dataframe" ,lattice = 'True')
  • thanx mate now able to extract the complete data but now getting error while concatenating two dataframe - TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid – arvin Nov 30 '22 at 10:29
  • What is your full code? Somewere in the pdf, there must be text that is viewed as list. As I mentioned in your questions comment, it is hard to help without knowing how the pdf looks like. Can you add some info to your question. – Serge de Gosson de Varennes Nov 30 '22 at 10:41
  • https://ibb.co/gtYHt13 - link to a snapshot of the pdf – arvin Nov 30 '22 at 11:32