2

I have a pdf file with tables in it and would like to read it as a dataframe using tabula. But only the first PDF page has column header. The headers of dataframes after page 1 becomes the first row on information. Is there any way that I can add the header from page 1 dataframe to the rest of the dataframes? Thanks in advance. Much appreciated!

Matthias Gallagher
  • 475
  • 1
  • 7
  • 20

1 Answers1

8

One can solve this by following steps:

  1. Read the PDF:

    tables = tabula.read_pdf(filename, pages='all', pandas_options={'header': None})

This will create a list of dataframes, having pages as dataframe in the list.

pandas_options={'header': None} is used not to take first row as header in the dataframe.

So, the header of the first page will be first row of dataframe in tables list.

  1. Saving header in a variable:

    cols = tables[0].values.tolist()[0]

This will create a list named cols, having first row of first df in tables list which is our header.

  1. Removing first row of first page:

    tables[0] = tables[0].iloc[1:]

This line will remove first row of first df(page) in tables list, as we have already stored in a variable we do not need it anymore.

  1. Giving header to all the pages:

    for df in tables: df.columns = cols

This loop will iterate through every dfs(pages) and give them the header we stored in cols variable.

So the header from page 1 dataframe will be given to the rest of the dataframes(pages).

You can also concat it in one dataframe with

import pandas as pd

and:

df_Final = pd.concat(tables)

Hope this helps you, thanks for this oppurtunity.

  • Thank you very much for your answer, Kathan! I really appreciate it. Just that I am stuck in step 4 with a 'ValueError: Length mismatch: Expected axis has 6 elements, new values have 9 elements'. I assume this is due to the pdf's structure. Can we take this pdf as an example for this operation? https://www.international.gc.ca/world-monde/assets/pdfs/international_relations-relations_internationales/sanctions/sema-lmes.pdf – Matthias Gallagher Mar 09 '21 at 02:09
  • Yes, in our answer Tabula is trying to extract data on it's own but sometimes where whole column's data is empty on the page, it neglects that column. So that is why on the page 12 of the mentioned PDF, 3 empty columns are neglected, and we are getting error of expected 9 but has 6 elements. For the workaround, you can use 'column' parameter. Look into my answer from https://stackoverflow.com/questions/66703584/python-tabula-for-table-with-no-distinct-table-lines/67104421#67104421, else i would provide proper coordinates in near future. – Kathan Thakkar Apr 15 '21 at 08:56