I'm using the tabula package in python 3 to get data from tables in pdfs.
I am trying to import tables from multiple pdfs online (e.g. http://trreb.ca/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf), but I am having trouble even getting one table imported properly.
Here is the code that I have run:
! pip install -q tabula-py
! pip install pandas
import pandas as pd
import tabula
from tabula import read_pdf
pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"
data = read_pdf(pdf, output_format='dataframe', pages="all")
data
which gives the following output:
[ Community Sales Dollar Volume ... Active Listings Avg. SP/LP Avg. DOM
0 Ajax 391 $265,999,351 ... 73 100% 21
1 Central East 32 $21,177,488 ... 3 99% 26
2 Northeast Ajax 70 $50,713,199 ... 18 100% 21
3 South East 105 $68,203,487 ... 15 100% 20
[4 rows x 9 columns]]
Which seems to work, except that it has missed every other row after "Central East". Here is the actual table in question, from the pdf at the url in the code above: Ajax Q4 2019
I have also tried fiddling with some of the options in the read_pdf
function, with minimal results.
The end goal will be a script that loops through all these "Community Reports" (there are quite a few), pulling all such tables from the pdfs, and consolidating them into one dataframe in python for analysis.
If the question isn't clear, or more info is needed, please let me know! I'm new to both python and stack exchange, so apologies if I'm not framing things correctly.
And of course any help would be greatly appreciated!
Bryn