How to import tables from multiple pdfs into a single data frame using python?

Question

I'm using the tabula package in python 3 to get data from tables in pdfs.

I am trying to import tables from multiple pdfs online (e.g. http://trreb.ca/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf), but I am having trouble even getting one table imported properly.

Here is the code that I have run:

! pip install -q tabula-py
! pip install pandas

import pandas as pd
import tabula
from tabula import read_pdf

pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"

data = read_pdf(pdf, output_format='dataframe', pages="all")

data

which gives the following output:

[        Community  Sales Dollar Volume  ... Active Listings Avg. SP/LP  Avg. DOM
 0            Ajax    391  $265,999,351  ...              73       100%        21
 1    Central East     32   $21,177,488  ...               3        99%        26
 2  Northeast Ajax     70   $50,713,199  ...              18       100%        21
 3      South East    105   $68,203,487  ...              15       100%        20

 [4 rows x 9 columns]]

Which seems to work, except that it has missed every other row after "Central East". Here is the actual table in question, from the pdf at the url in the code above: Ajax Q4 2019

I have also tried fiddling with some of the options in the read_pdf function, with minimal results.

The end goal will be a script that loops through all these "Community Reports" (there are quite a few), pulling all such tables from the pdfs, and consolidating them into one dataframe in python for analysis.

If the question isn't clear, or more info is needed, please let me know! I'm new to both python and stack exchange, so apologies if I'm not framing things correctly.

And of course any help would be greatly appreciated!

Bryn

score 0 · Answer 1 · answered Mar 21 '20 at 16:12

The following code almost worked:

pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"

from tabula import convert_into
convert_into(pdf, "test.csv", pages="all", lattice="true")

with open("test.csv",'r') as f:
    with open("updated_test.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

data = pd.read_csv("updated_test.csv")

# rename first column, drop unwanted rows

data.rename(columns = {'Unnamed: 0':'Community'}, inplace=True)
data.dropna(inplace=True)

data

and gives output:

Community   Year    Quarter Sales   Dollar Volume   Average Price   Median Price    New Listings    Active Listings Avg. SP/LP
1   Central 2019    Q4  44.0    $27,618,950 $627,703    $630,500    67.0    8.0 99%
2   Central East    2019    Q4  32.0    $21,177,488 $661,797    $627,450    34.0    3.0 99%
3   Central West    2019    Q4  57.0    $40,742,450 $714,780    $675,000    65.0    7.0 99%
4   Northeast Ajax  2019    Q4  70.0    $50,713,199 $724,474    $716,500    82.0    18.0    100%
5   Northwest Ajax  2019    Q4  49.0    $37,192,790 $759,037    $765,000    63.0    14.0    99%
6   South East  2019    Q4  105.0   $68,203,487 $649,557    $640,000    117.0   15.0    100%
7   South West  2019    Q4  34.0    $20,350,987 $598,558    $590,000    36.0    8.0 99%

the only issue here is the last column, "Avg. DOM", wasn't picked up by the convert_into command.

For my analysis this doesn't matter, but it could definitely be an issue for others trying to pull tables in a similar manner.

How to import tables from multiple pdfs into a single data frame using python?

1 Answers1