How to extract multiples tables from one PDF file using Pandas and tabula-py

Question

Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:

Table exp in every page

student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4

I want to extract all this tables in one dataframe, First i did

df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)

But i got a messy output so i try this lines of code that looks like this :

[student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4 ,student  Score Rang
Maxim    43     34
Nourah   93     5]

so i edited my code like this import pandas as pd import tabula

    file_path = "filePath.pdf"
    
    # read my file
    df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
    df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
    df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)

It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.

mozway · Accepted Answer · 2021-07-16T12:46:32.623

3

According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.

Thus, you can use pandas.concat on its output to concatenate the dataframes:

df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

edited Jul 16 '21 at 12:46

answered Jul 16 '21 at 12:06

mozway

194,879
13
39
75

I tried this too, but i got an error `TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid` – Learner Jul 16 '21 at 12:11
what is the return of this command: `type(tabula.read_pdf(file_path,pages=1,multiple_tables=True))`? I suspect, this is a list because of the `multiple_tables=True` option and you need to take the first item. If the return is `list`, please also provide the return of: `type(tabula.read_pdf(file_path,pages=1,multiple_tables=True)[0])` – mozway Jul 16 '21 at 12:15
According to [tabula's documentation](https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf), read_pdf returns a list. See my updated answer – mozway Jul 16 '21 at 12:20
Please, can you also give me the output of `pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))`? – mozway Jul 16 '21 at 12:22
Yes it return a list. `pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))` this is work – Learner Jul 16 '21 at 12:43

How to extract multiples tables from one PDF file using Pandas and tabula-py

1 Answers1