Scraping tables from various PDF-files

Asked Nov 16 '21 at 10:04

Active Nov 27 '21 at 18:54

Viewed 131 times

I am figuring out how to loop to various multiple-page PDF-files and scrape their tables nicely into Excel-files. However, camelot and tabula are unable to process the PDF-files:

# pip install --upgrade camelot-py[cv] tabula-py excalibur-py

import tabula as tb
import camelot
import pandas as pd
import os

BASE_PATH = os.path.dirname((os.path.abspath(r"...")))

FOLDER_PATH = os.path.join(BASE_PATH, r"...")

pdfs = [os.path.abspath(x) for x in os.listdir(r"...") if x.endswith(".pdf")]

#

listoflengths = []

def len_table(filepath):
    tables = camelot.read_pdf(filepath, flavor='stream', columns=['300'], split_text=True)
    tablelength = len(tables)
    listoflengths.append(tablelength)

#    

pdfs[0]

len_table(pdfs[1])

# print(listoflengths)

Is there any solution to this? I need to work around the manual process of loading tables from PDF-files into Excel.

edited Nov 27 '21 at 18:54

DisappointedByUnaccountableMod

6,656
4
18
22

asked Nov 16 '21 at 10:04

seanb-latex

What is the problem you are facing? – Niko Föhr Nov 16 '21 at 10:05
1

Camelot gives the error that the file is not in the correct format. It looks like the table should be very clearly outlined in the PDF without too much other stuff going on the page. – seanb-latex Nov 16 '21 at 11:00
But can Python "scan" these PDF's and scrape the tables correctly into Excel? Or do these packages need more specifications? Some PDF's I am dealing with contain 200+ pages and all I need are those tables planted in Excel for analysis. Thanks! – seanb-latex Nov 18 '21 at 08:33

Scraping tables from various PDF-files

0 Answers0